45
All Rights Reserved © The Canadian Historical Association / La Société historique du Canada, 2013 Ce document est protégé par la loi sur le droit d’auteur. L’utilisation des services d’Érudit (y compris la reproduction) est assujettie à sa politique d’utilisation que vous pouvez consulter en ligne. https://apropos.erudit.org/fr/usagers/politique-dutilisation/ Cet article est diffusé et préservé par Érudit. Érudit est un consortium interuniversitaire sans but lucratif composé de l’Université de Montréal, l’Université Laval et l’Université du Québec à Montréal. Il a pour mission la promotion et la valorisation de la recherche. https://www.erudit.org/fr/ Document généré le 6 oct. 2020 10:46 Journal of the Canadian Historical Association Revue de la Société historique du Canada Mining the ‘Internet Graveyard’: Rethinking the Historians’ Toolkit Ian Milligan Volume 23, numéro 2, 2012 URI : https://id.erudit.org/iderudit/1015788ar DOI : https://doi.org/10.7202/1015788ar Aller au sommaire du numéro Éditeur(s) The Canadian Historical Association / La Société historique du Canada ISSN 0847-4478 (imprimé) 1712-6274 (numérique) Découvrir la revue Citer cet article Milligan, I. (2012). Mining the ‘Internet Graveyard’: Rethinking the Historians’ Toolkit. Journal of the Canadian Historical Association / Revue de la Société historique du Canada, 23 (2), 21–64. https://doi.org/10.7202/1015788ar Résumé de l'article Dans cet article, l’auteur soutient que la production d’une grande quantité de sources historiques numériques nécessite une réévaluation du coffre à outils des historiennes et des historiens. Une troisième vague d’histoire informatique, marquée par un nombre toujours croissant d’informations numérisées (surtout dans le cadre d’Internet), la chute des coûts d’entreposage des données numériques, le développement des nuages informatiques et l’augmentation parallèle de la capacité d’utiliser ces sources, bouleverse déjà la pratique historienne. Cet article se veut une étude de cas basée sur ces observations. Il étudie plus particulièrement un projet de numérisation de Bibliothèque et Archives Canada — les Collections numérisées du Canada — pour éclairer certains défis ci-haut mentionnés. Un ensemble de solutions prêtes à utiliser pour l’analyse de données, jumelé avec un code informatique écrit en Mathematica, peut contribuer à retracer le contexte et à retrouver des informations à partir d’une collection numérique précédemment inaccessible aux chercheurs étant donné sa taille. L’article se termine par une présentation des différents outils informatiques accessibles aux historiens ainsi que par un appel à l’acquisition d’une plus grande culture numérique dans les curriculums en histoire et dans le développement professionnel des historiens.

Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

All Rights Reserved copy The Canadian Historical Association La Socieacuteteacutehistorique du Canada 2013

Ce document est proteacutegeacute par la loi sur le droit drsquoauteur Lrsquoutilisation desservices drsquoEacuterudit (y compris la reproduction) est assujettie agrave sa politiquedrsquoutilisation que vous pouvez consulter en lignehttpsaproposeruditorgfrusagerspolitique-dutilisation

Cet article est diffuseacute et preacuteserveacute par EacuteruditEacuterudit est un consortium interuniversitaire sans but lucratif composeacute delrsquoUniversiteacute de Montreacuteal lrsquoUniversiteacute Laval et lrsquoUniversiteacute du Queacutebec agraveMontreacuteal Il a pour mission la promotion et la valorisation de la recherchehttpswwweruditorgfr

Document geacuteneacutereacute le 6 oct 2020 1046

Journal of the Canadian Historical AssociationRevue de la Socieacuteteacute historique du Canada

Mining the lsquoInternet Graveyardrsquo Rethinking the HistoriansrsquoToolkitIan Milligan

Volume 23 numeacutero 2 2012

URI httpsideruditorgiderudit1015788arDOI httpsdoiorg1072021015788ar

Aller au sommaire du numeacutero

Eacutediteur(s)The Canadian Historical Association La Socieacuteteacute historique du Canada

ISSN0847-4478 (imprimeacute)1712-6274 (numeacuterique)

Deacutecouvrir la revue

Citer cet articleMilligan I (2012) Mining the lsquoInternet Graveyardrsquo Rethinking the HistoriansrsquoToolkit Journal of the Canadian Historical Association Revue de la Socieacuteteacutehistorique du Canada 23 (2) 21ndash64 httpsdoiorg1072021015788ar

Reacutesumeacute de larticleDans cet article lrsquoauteur soutient que la production drsquoune grande quantiteacute desources historiques numeacuteriques neacutecessite une reacuteeacutevaluation du coffre agrave outilsdes historiennes et des historiens Une troisiegraveme vague drsquohistoire informatiquemarqueacutee par un nombre toujours croissant drsquoinformations numeacuteriseacutees(surtout dans le cadre drsquoInternet) la chute des coucircts drsquoentreposage des donneacuteesnumeacuteriques le deacuteveloppement des nuages informatiques et lrsquoaugmentationparallegravele de la capaciteacute drsquoutiliser ces sources bouleverse deacutejagrave la pratiquehistorienne Cet article se veut une eacutetude de cas baseacutee sur ces observations Ileacutetudie plus particuliegraverement un projet de numeacuterisation de Bibliothegraveque etArchives Canada mdash les Collections numeacuteriseacutees du Canada mdash pour eacuteclairercertains deacutefis ci-haut mentionneacutes Un ensemble de solutions precirctes agrave utiliserpour lrsquoanalyse de donneacutees jumeleacute avec un code informatique eacutecrit enMathematica peut contribuer agrave retracer le contexte et agrave retrouver desinformations agrave partir drsquoune collection numeacuterique preacuteceacutedemment inaccessibleaux chercheurs eacutetant donneacute sa taille Lrsquoarticle se termine par une preacutesentationdes diffeacuterents outils informatiques accessibles aux historiens ainsi que par unappel agrave lrsquoacquisition drsquoune plus grande culture numeacuterique dans lescurriculums en histoire et dans le deacuteveloppement professionnel des historiens

Mining the lsquoInternet Graveyardrsquo Rethinking theHistoriansrsquo Toolkit

IAN MILLIGAN

Abstract

ldquoMining the Internet Graveyardrdquo argues that the advent of massivequantity of born-digital historical sources necessitates a rethinking of thehistoriansrsquo toolkit The contours of a third wave of computational historyare outlined a trend marked by ever-increasing amounts of digitizedinformation (especially web based) falling digital storage costs a moveto the cloud and a corresponding increase in computational power toprocess these sources Following this the article uses a case study of anearly born-digital archive at Library and Archives Canada ndash CanadarsquosDigital Collections project (CDC) ndash to bring some of these problems intoview An array of off-the-shelf data analysis solutions coupled with codewritten in Mathematica helps us bring context and retrieve informationfrom a digital collection on a previously inaccessible scale The articleconcludes with an illustration of the various computational tools avail-able as well as a call for greater digital literacy in history curricula andprofessional development

Reacutesumeacute

Dans cet article lrsquoauteur soutient que la production drsquoune grande quan-titeacute de sources historiques numeacuteriques neacutecessite une reacuteeacutevaluation du coffreagrave outils des historiennes et des historiens Une troisiegraveme vague drsquohistoireinformatique marqueacutee par un nombre toujours croissant drsquoinformationsnumeacuteriseacutees (surtout dans le cadre drsquoInternet) la chute des coucircts drsquoentre-

My sincerest thanks to William Turkel who sent me down the program-ming rabbit hole Thanks also to Jennifer Bleakney Thomas Peace andthe anonymous peer reviewers for their suggestions and assistanceFinancial support was generously provided by the Social Sciences andHumanities Research Council of Canada

JOURNAL OF THE CHA 2012 New Series Vol 23 no 2

REVUE DE LA SHC 2012Nouvelle seacuterie vol 23 no 2

posage des donneacutees numeacuteriques le deacuteveloppement des nuages informa-tiques et lrsquoaugmentation parallegravele de la capaciteacute drsquoutiliser ces sourcesbouleverse deacutejagrave la pratique historienne Cet article se veut une eacutetude decas baseacutee sur ces observations Il eacutetudie plus particuliegraverement un projetde numeacuterisation de Bibliothegraveque et Archives Canada mdash les Collectionsnumeacuteriseacutees du Canada mdash pour eacuteclairer certains deacutefis ci-haut mention-neacutes Un ensemble de solutions precirctes agrave utiliser pour lrsquoanalyse de donneacuteesjumeleacute avec un code informatique eacutecrit en Mathematica peut contri-buer agrave retracer le contexte et agrave retrouver des informations agrave partir drsquounecollection numeacuterique preacuteceacutedemment inaccessible aux chercheurs eacutetantdonneacute sa taille Lrsquoarticle se termine par une preacutesentation des diffeacuterentsoutils informatiques accessibles aux historiens ainsi que par un appel agravelrsquoacquisition drsquoune plus grande culture numeacuterique dans les curriculumsen histoire et dans le deacuteveloppement professionnel des historiens

The information produced and consumed by humankindused to vanish mdash that was the norm the default Thesights the sounds the songs the spoken word just meltedaway Marks on stone parchment and paper were the spe-cial case It did not occur to Sophoclesrsquo audiences that itwould be sad for his plays to be lost they enjoyed theshow Now expectations have inverted Everything may berecorded and preserved at least potentially James Gleick The Information A History a Theory aFlood 396ndash7

Almost every day almost every Canadian generates born-digital infor-mation that is fast becoming a sea of data Historians must learn howto navigate this sea of digital material Where will the historian of thefuture go to research todayrsquos history What types of sources will theyuse We are currently experiencing a revolutionary medium shift Asthe price of digital storage plummets and communications are increas-ingly carried through digitized text pictures and video the primarysources left for future historians will require a significantly expandedand rethought toolkit The need to manage a plethora of born-digitalsources (those originally digitally created) as an essential part of historywill be on us soon While there is no commonly accepted rule-of-

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

thumb for when a topic becomes ldquohistoryrdquo it is worth noting as anexample that it took less than 30 years after the tumultuous year of1968 for a varied developed and contentious North American histo-riography to appear on the topic of life in the 1960s1 In 2021 it willhave been a similar 30 years since in August 1991 Tim Berners-Leea fellow at the European Organization for Nuclear Research (CERN)published the very first website and launched the World Wide WebProfessional historians need to be ready to add and develop new skillsto deal with sources born in a digital age

Digital sources necessitate a rethinking of the historianrsquos toolkitBasic training for new historians requires familiarity with newmethodological tools and making resources for the acquisition ofthese tools available to their mentors and teachers Not only willthese tools help in dealing with more recent born-digital sources butthey will also help historians capitalize on digitized archival sourcesfrom our more-distant past This form of history will not replace ear-lier methodologies but instead play a meaningful collaborative andsupportive role It will be a form of distant reading encompassingthousands or tens of thousands of sources that will complement themore traditional and critically important close reading of smallbatches of sources that characterizes so much of our work Indeedmuch of what we think of as ldquodigital historyrdquo may simply be ldquohis-toryrdquo in the years to come2

There are a number of avenues along which this transformationwill take place First it is important to establish a map that situatesthis emerging field into what has come before Second I examine thecurrent situation of born-digital sources and historical practiceThird I use an early born-digital archive from Library and ArchivesCanada (LAC) Canadarsquos Digital Collections project (CDC) to eval-uate how we access and make sense of digital information today Ithen apply emergent visualization techniques to the sources in thisarchive to show one way through which historians can tackle born-digital source issues Lastly I will lay out a road map to ensure thatthe next generation has enough basic digital literacy to ensure thathistorians can adequately analyze this new form of source material

Imagine a future historian taking on a central question of socialand cultural life in the middle of the first decade of the twenty-first

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

century such as how Canadians understood Idle No More or theOccupy movement through social media What would her archivelook like I like to imagine it as boxes stretching out into the dis-tance tapering off without any immediate reference points to bringit into perspective Many of these digital archives will be withouthuman-generated finding aids would have perhaps no dividers orheadings within ldquofilesrdquo and certainly no archivist with a compre-hensive grasp of the collection contents A recent example isillustrative During the IdleNoMore protests Twitter witnessed anastounding 55334 tweets on 11 January 2013 Each tweet can be up to 140 characters To put that into perspective this article is lessthan 900 lines long in a word processor Yet this massive amount ofinformation looks cryptic strange usernames no archives no fold-ers not even an immediate way to see whether a given tweet isrelevant to your research or not3 This is a vast collection Informationoverload

This example is indicative of the issues facing the historical pro-fession4 I am not exaggerating Every day half a billion tweets aresent5 hundreds of thousands of comments are uploaded to newssites and scores of blog posts are uploaded This makes up one smallaspect of broader data sets automated logs including climate read-ings security access reports library activity search patterns booksand movies ordered from Amazonca or Netflix6 Even if only a smallfraction of this is available to future historians it will represent anunparalleled source of information about how life was lived (orthought to have been lived) during the early-twenty-first century Toprovide another perhaps more tangible example while carrying outmy previous work on youth cultures in 1960s English-Canada I wasstruck by how inaccessible television sources were compared to news-paper records despite television being arguably the primary media ofthe time Unlike the 1960s however the Internet leaves a plethoraof available mdash if difficult to access mdash primary sources for historiansIn the digital age archiving will be different which means that thehistory we write will be different

We must begin planning for this plethora of born-digital sourcesnow Using modern technology we have the ability to quickly findneedles in digital haystacks to look inside a lsquoboxrsquo and immediately

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

find that relevant letter rather than spending hours searching andreading everything Yet we also have the ability to pull our gaze backdistantly elucidating the context to our relevant documents as well

The Necessity of Distant Reading Historians andComputational History

Franco Moretti in his 2005 groundbreaking work of literary criti-cism Graphs Maps Trees called on his colleagues to rethink theircraft With the advent of mass produced novels in nineteenth-cen-tury Britain Moretti argued that scholars must change theirdisciplinary approach Traditionally literary scholars worked with acorpus of around 200 novels though an impressive number for theaverage person it represents only around one percent of the totaloutput of that period To work with all of this material in an effortto understand these novels as a genre critics needed to approachresearch questions in a different way

[C]lose reading wonrsquot help here a novel a day every day ofthe year would take a century or so hellip And itrsquos not evena matter of time but of method a field this large cannotbe understood by stitching together separate bit of knowl-edge about individual cases because it isnrsquot a sum ofindividual cases itrsquos a collective system that should begrasped as such as a whole7

Instead of close reading Moretti called for literary theorists to prac-tice ldquodistant readingrdquo

For inspiration Moretti drew on the work of French Annaleshistorian Fernand Braudel Braudel amongst the most instrumentalhistorians of the twentieth century pioneered a similarly distantapproach to history his inquiries spanned large expanses of time andspace that of civilizations the Mediterranean world as well assmaller (only by comparison) regional histories of Italy and FranceHis distant approach to history did not necessitate disengagementwith the human actors on the ground seeing instead that beyond thenarrow focus of individual events lay the constant ebb and flow of

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

poverty and other endemic structural features8 While his methodol-ogy does not apply itself well to rapidly changing societies (his focuswas on the long-term slow change) his points around distant read-ing and the need for collaborative interdisciplinary research are auseful antecedent Similar opportunities present themselves whendealing with the vast array of born-digital sources (as well as digitizedhistorical sources) we can distantly read large arrays of sources froma longue dureacutee of time and space mediated through algorithms andcomputer interfaces opening up the ability to consider the voices ofeverybody preserved in the source

Historians must be open to the digital turn due to the astound-ing growth of digital sources and an increasing technical ability toprocess them on a mass scale Social sciences and humanitiesresearchers have begun to witness a profound transformation in howthey conduct and disseminate their work Indeed historical digitalsources have reached a scale where they defy conventional analysisand now require computational analysis Responding to thisarchives are increasingly committed to preserving cultural heritagematerials in digital forms

I would like to pre-emptively respond to two criticisms of thisdigital shift within the historical profession First historians will notall have to become programmers Just as not all historians need a firmgrasp of Geographical Information Systems (GIS) or a developedunderstanding of the methodological implications of community-based oral history the incorporation of a transnational perspective orin-depth engagement with cutting edge demographic models not allhistorians have to approach their trade from a computational per-spective Nor should they Digital history does not replace closereading traditional archival inquiry or going into communities mdash touse only a few examples mdash to uncover notions of collective memoryor trauma Indeed digital historians will play a facilitative role andprovide a broader reading context Yet they will still be historians col-lecting relevant sources analyzing and contextualizing them situatingthem in convincing narratives or explanatory frameworks and dis-seminating their findings to wider audiences

Neither will historians replace professional programmers Insteadas with other subfields historians using digital sources will need to be

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

prepared to work on larger truly interdisciplinary teams Computer sci-entists statisticians and mathematicians for example have criticalmethodological and analytical contributions to make towards questionsof historical inquiry Historians need to forge further links with thesedisciplines Such meaningful collaboration however will require histo-rians who can speak their languages and who have an understanding oftheir work Some of us must be able to discuss computational or algo-rithmic methodologies when we collectively explore the past

The methodologies discussed in this paper fall into the twinfields of data mining and textual analysis The former refers to thediscovery of patterns or information about large amounts of infor-mation while the latter refers to the use of digital tools to analyzelarge quantities of text9 The case study here draws upon three mainconcepts information retrieval (quickly finding what you want)information and term frequency (how often do various concepts andterms appear) and topic modeling (finding frequently occurringbags of words throughout documents) Other projects have involvedauthorship attribution which is finding out the author of docu-ments based on previous patterns style analysis or creating networkgraphs of topics groups documents or individuals

The Culturomics project run by the Cultural Observatory atHarvard University is a notable model of collaboration on a big his-tory project As laid out in Science and put online as the GoogleBooks n-gram viewer (wwwbooksgooglecomngrams) this projectindexed word and phrase frequency across over five million booksenabling researchers to trace the rise and fall of cultural ideas andphenomena through targeted keyword and phrase searches10 This isthe most ambitious and certainly the most advanced and accessibleBig History project yet launched Yet amongst the extensive authorlist (13 individuals and the Google Books team) there were no his-torians This suggests that other disciplines in the humanities andsocial sciences have embraced the digital turn far more than histori-ans However as I mentioned in the coming decades historians willincreasingly rely on digital archives There is a risk that history as aprofessional discipline could be left behind

American Historical Association President Anthony Graftonnoted this absence in a widely circulated Perspectives article Why was

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

there no historian he asked when this was both primarily a histori-cal project and a team comprised of varied doctoral holders fromdisciplines such as English literature psychology computer sciencebiology and mathematics12 Culturomics project leaders ErezLeiberman Aiden and Jean-Baptiste Michel responded franklyWhile the team approached historians and some did play advisoryroles every author listed ldquodirectly contributed to either the creationof the collection of written texts (the lsquocorpusrsquo) or to the design andexecution of the specific analyses we performed No academic histo-rians met this barrdquo They continued to indict the profession

The historians who came to the meeting were intelligentkind and encouraging But they didnrsquot seem to have agood sense of how to wield quantitative data to answerquestions didnrsquot have relevant computational skills anddidnrsquot seem to have the time to dedicate to a big multi-author collaboration Itrsquos not their fault these things donrsquotappear to be taught or encouraged in history departmentsright now13

This is a serious indictment considering the pending archival shift todigital sources and one that still largely holds true (a few historydepartments now offer digital history courses but they are still out-liers) This form of history the massive analysis of digitizedinformation will continue Furthermore these approaches willattract public attention as they lend themselves well to accessibleonline deployment Unless historians are well positioned they riskceding professional ground to other disciplines

While historians were not involved in Culturomics they arehowever involved in several other digital projects Two major andambitious long-term Canadian projects are currently using compu-tational methods to reconstruct the demographic past The first isthe Canadian Century Research Infrastructure project which is cre-ating databases from five censuses and aiming to provide ldquoa newfoundation for the study of social economic cultural and politicalchangerdquo while the second is the Programme de recherche en deacutemo-graphie historique hosted by the Universiteacute de Montreacuteal13 They are

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

reconstructing the European population of Quebec in the seven-teenth and eighteenth centuries drawing heavily on parish registersFurthermore several projects in England have harnessed the wave ofinterest in ldquobig datardquo to tell compelling historical stories about theirpasts the Criminal Intent project for example used the proceedingsof the Old Bailey courts (the main criminal court in Greater London)to visualize large quantities of historical information14 The CriminalIntent project was itself funded by the Digging into Data Challengewhich brings together multinational research teams and fundingfrom federal-level funding agencies in Canada the United States theUnited Kingdom and beyond Several other historical studies havebeen funded under their auspices from studies of commodity tradein the eighteenth- and nineteenth-century Atlantic world (involvingscholars at the University of Edinburgh the University of StAndrews and York University) to data mining the degrees of eco-nomic opportunity and spatial mobility in Britain Canada and theUnited States (bringing researchers at the University of Guelph theUniversity of Leicester and the University of Minnesota together)15

These exciting interdisciplinary teams may be heralding a disciplinalshift although it is still early days

Collaborative teams have limitations though We will have tobe prepared to tackle digital sources alone on shoestring budgetsHistorians will not always be able to call on external programmers orspecialists to do work because teams often develop out of fundinggrants As governmental and university budgets often prove unableto meet consistent research needs baseline technical knowledge isrequired for independent and unfunded work If historians relyexclusively on outside help to deal with born-digital sources the abil-ity to do computational history diminishes As the success of theProgramming Historian indicates and as I will demonstrate in mycase study below historians can learn basic techniques16 Let ustweak our own algorithms

A social or cultural history of the very late-twentieth or twenty-first century will have to account for born-digital sources andhistorians will (in conjunction with archival professionals) have tocurate and make sense of the data Historians need to be able toengage with these sources themselves or at least as part of a team

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

rather than having sources mediated through the lens of outside dis-ciplines or commercial interests Algorithms may be one meaningfulway to make sense of cultural trends or extract meaningful informa-tion from billions of tweets but we must ensure professionalhistorical input goes into their crafting

The Third Wave of Computational History

Historians have fruitfully used computers in the past as part of twoprevious waves The 1960s and 1970s saw computational tools usedfor demographic population and economic histories a trend that inCanada saw full realization in comprehensive studies such as MichaelKatzrsquos The People of Hamilton Canada West17 As Ian Anderson hasconvincingly noted in a review of history and computing studies ofthis sort saw the field become associated with quantitative studies18

After a retreat from the spotlight computational history rose again inthe 1990s with the advent of the personal computer graphical userinterfaces and overall improvements in ease of use Punch cards couldnow give way to database programs GIS and early online networkssuch as H-Net and Usenet We are now I believe on the cusp of athird revolution in computational history thanks to three main fac-tors decreasing storage costs the power of the Internet anddistributed cloud computing and the rise of professionals dealingwith both digital preservation and open-source tools

Decreasing storage costs have led to massive amounts of infor-mation being preserved as Gleick noted in this articlersquos epigraph Itis important to note that this information overload is not newPeople have long worried about the impact of too much informa-tion19 In the sixteenth century responding to the rise of the printingpress the German priest Martin Luther decried that the ldquomultitudeof books [were] a great evilrdquo In the nineteenth century Edgar AllanPoe bemoaned ldquo[t]he enormous multiplication of books in everybranch of knowledge is one of the greatest evils of this agerdquo Asrecently as 1970 American historian Lewis Mumford lamented ldquotheoverproduction of books will bring about a state of intellectual ener-vation and depletion hardly to be distinguished from massiveignorancerdquo20 The rise of born-digital sources must thus be seen in

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

this continuous context of handwringing around the expansion andrise of information

Useful historical information is being preserved at a rate that isaccelerating with every passing day One useful metric to comparedataset size is that of the American Library of Congress (LOC)Indeed a ldquoLOCrdquo has become shorthand for a unit of measurementin the digital humanities Measuring the actual extant of the printedcollection in terms of data is notoriously difficult James Gleick hasclaimed it to be around 10TB22 whereas the Library of Congressitself claims a more accurate figure is about 200TB23 While thesetwo figures are obviously divergent neither is on an unimaginablescale mdash personal computers now often ship with a terabyte or moreof storage When Claude Shannon the father of information theorycarried out a thought experiment in 1949 about items or places thatmight store ldquoinformationrdquo (an emerging field which viewed messagesin their aggregate) he scaled sites from a digit upwards to the largestcollection he could imagine The list scaled up logarithmically from10 to the power of 0 to 10 to the power of 13 If a single-spaced pageof typing is 104 107 the Proceedings of the Institute of Radio Engineers109 the entire text of the Encyclopaedia Brittanica then 1014 was ldquothelargest information stockpile he could think of the Library ofCongressrdquo24

But now the Library of Congressrsquo print collection mdash somethingthat has taken two centuries to gather mdash is dwarfed by their newestcollection of archived born-digital sources the vast majority only amatter of years old compared to the much wider date-range of non-digital sources in the traditional collection The LOC has beguncollecting its own born-digital Internet archive multiple snapshotsare taken of webpages to show how they change over time Even asa selective curated collection drawing on governmental politicaleducational and creative websites the LOC has already collected254TB of data and adds 5TB a month25 The Internet Archivethrough its Wayback Machine is even more ambitious It seeks toarchive every website on the Internet While its size is also hard toput an exact finger on as of late 2012 it had over 10 Petabytes ofinformation (or to put it into perspective a little over 50 Library ofCongresses if we take the 200TB figure)26

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

This is to some degree comparing apples and oranges theLibrary of Congress is predominantly print whereas the InternetArchive has considerable multimedia holdings which includesvideos images and audio files The LOC is furthermore a morecurated collection whereas the Internet Archive draws from a widerrange of producers The Internet offers the advantages and disadvan-tages of being a more democratic archive For example a Geocitiessite (an early way for anybody to create their own website) created bya 12-year old Canadian teenager in the mid-1990s might be pre-served by the Internet Archive whereas it almost certainly would notbe saved by a national archive This difference in size and collectionsmanagement however is at the root of the changing historiansrsquotoolkit If we make use of these large data troves we can access a newand broad range of historical subjects

Conventional sources are also part of this deluge Google Bookshas been digitizing the printed word and it currently has 15 millionworks completed by 2020 it audaciously aims to digitize every bookever written While copyright issues loom the project will allowresearchers to access aggregate word and phrase data as seen in theGoogle n-gram project27 In Canada Library and Archives Canada(LAC) has a more circumscribed digitization project collecting ldquoarepresentative sample of Canadian websitesrdquo focusing particularlyon Government of Canada webpages as well as a curated approach

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

preserving case studies such as the 2006 federal election Currentlyit has 4TB of federal government information and 7TB of totalinformation in its larger web archive (a figure which as I note belowis unlikely to drastically improve despite ostensible goals of archivallsquomodernizationrsquo)28

In Canada LAC has mdash on the surface mdash has been advancingan ambitious digitization program as it realizes the challenges nowfacing archivists and historians Daniel J Caron the Librarian andArchivist of Canada has been outspoken on this front with severalpublic addresses on the question of digital sources and archivesThroughout 2012 primary source digitization at LAC preoccupiedcritics who saw it as a means to depreciate on-site access TheCanadian Association of University Teachers the Canadian HistoricalAssociation as well as other professional organizations mountedcampaigns against this shift in resources29 LAC does have a pointhowever with their modernization agenda The state of Canadarsquosonline collections are small and sorely lacking when compared totheir on-site collections and LAC does need to modernize and achievethe laudable goal of reaching audiences beyond Ottawa

Nonetheless digitization has unfortunately been used as asmokescreen to depreciate overall service offerings30 The modern-ization agenda would also see 50 percent of the digitization staffcut31 LACrsquos online collections are small they do not work withonline developers through Application Programming Interfaces(APIs a way for a computer program to talk to another computerand speed up research mdash Canadianaca is incidentally a leader in thisarea) and there have been no promising indications that this state ofaffairs will change Digitization has not been comprehensive Onlinepreservation is important and historians must fight for both on-siteand on-line access

A number of historians have recognized this challenge and areplaying an instrumental role in preserving the born-digital past TheRoy Rosenzweig Center for History and New Media at GeorgeMason University launched several high-profile preservation pro-jects the September 11th Digital Archive the Hurricane DigitalMemory Bank (focusing on Hurricanes Katrina and Rita) and as ofwriting the Occupy archive32 Massive amounts of online content

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

is curated and preserved photographs news reports blog posts andnow tweets These complement more traditional efforts of collectingand preserving oral histories and personal recollections which arethen geo-tagged (tied to a particular geographical location) tran-scribed and placed online These archives serve a dual purposepreserving the past for future generations while also facilitating easydissemination for todayrsquos audiences

Digital preservation however is a topic of critical importanceMuch of our early digital heritage has been lost Many websites espe-cially those before the Internet Archiversquos 1996 web archiving projectlaunch are completely gone Early storage mediums (from tape toearly floppy disks) have become obsolete and are now nearly inac-cessible Compounding this software development has seen earlyproprietary file formats depreciated ignored and eventually madewholly inaccessible33 Fortunately the problem of digital preservationwas recognized by the 1990s and in 2000 the American governmentestablished the National Digital Information Infrastructure andPreservation Program34 As early as 1998 digital preservationistsargued ldquo[h]istorians will look back on this era hellip and see a period ofvery little information A lsquodigital gaprsquo will span from the beginningof the wide-spread use of the computer until the time we eventuallysolve this problem What wersquore all trying to do is to shorten thatgaprdquo To bear home their successes that quotation is taken from anInternet Archive snapshot of Wired magazinersquos website35

Indeed greater attention towards rigorous metadata (which isbriefly put data about data often preserved in plain text format)open-access file formats documented by the InternationalOrganization for Standardization (ISO) and dedicated digitalpreservation librarians and archivists gives us hope for the future36

It is my belief that cloud storage the online hosting of data in largethird-party offsite centres mitigates the issue of obsolete storagemedium Furthermore that the move towards open source file for-mats even amongst large commercial enterprises such as Microsoftprovides greater documentation and uniformity

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A Brief Case Study Accessing and Navigating Canadarsquos DigitalCollections With an Emphasis on Tools for Born-DigitalSources

What can we do about this digital deluge There are no simpleanswers but historians must begin to conceptualize new additions totheir traditional research and pedagogical toolkits In the section thatfollows I will do two things introduce a case study and several toolsboth proprietary and those that are open and available off-the-shelfto reveal one way forward

For a case study I have elected to use Canadarsquos DigitalCollections (CDC) a collection of lsquodeadrsquo websites removed fromtheir original location on the World Wide Web and now digitallyarchived at LAC In this section I will introduce the collection usingdigitally-archived sources from the early World Wide Web and dis-cuss my research methodology before illustrating how we can derivemeaningful data from massive arrays of unstructured data37

This is a case study of a web archive Several of these method-ologies could be applied to other formats including large arrays oftextual material whether born-digital or not A distant reading of thistype is not medium specific It does however assume that the text ishigh quality Digitized primary documents do not always have textlayers and if they are instituted computationally through OpticalCharacter Recognition errors can appear Accuracy rates for digitizedtextual documents can range from as low as 40 percent for seven-teenth and eighteenth century documents to 80ndash90 percent formodern microfilmed newspapers to over 99 percent for typesetword-processed documents38 That said while historians today maybe more interested in applying distant reading methodologies tomore traditional primary source materials it is my belief that in thenot-so-distant future websites will be an increasingly significantsource for social cultural and political historians

What are Canadarsquos Digital Collections In 1996 IndustryCanada funded by the federal Youth Employment Strategy facili-tated the development of Canadian heritage websites by youth(defined as anyone between the age of 15 and 30) It issued contractsto private corporations that agreed to use these individuals as web

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

developers for the twofold purpose of developing digital skills andinstilling respect for and engagement with Canadian heritage andpreservation39 The collection initially called SchoolNet butrebranded as Canadarsquos Digital Collections in 1999 was accessiblefrom the CDC homepage After 2004 the program was shut downand the vast majority of the websites were archived on LACrsquos web-page This collection is now the largest born-digital Internet archivethat LAC holds worthy of study as content as well as thinkingabout methodological approaches

Conventional records about the program are unavailable in thetraditional LAC collection unsurprising in light of both its recentcreation and backlogs in processing and accessioning archival mate-rials We can use one element of our born-digital toolkit the InternetArchiversquos WaybackMachine (its archive of preserved early webpages)to get a sense of what this project looked like in its heyday The mainwebpage for the CDC was wwwcollectionsicgc However a visitortoday will see nothing but a collection of broken image links (two ofthem will bring you to the English and French-language versions ofthe LAC-hosted web archive although this is only discerniblethrough trial and error or through the sitersquos source code)

With the WaybackMachine we have a powerful albeit limitedopportunity to ldquogo back in timerdquo and experience the website as previ-ous web viewers would have (if a preserved website is available ofcourse) Among the over 420 million archived websites we can searchfor and find the original CDC site Between 12 October 1999 andthe most recent snapshot or web crawl (of the now-broken site) on 6July 2011 the site was crawled 410 times40 Unfortunately every siteversion until the 15 October 2000 crawl has been somehow cor-rupted Content has been instead replaced by Chinese characterswhich translate only into place names and other random words Issuesof early digital preservation noted above have reared their head inthis collection

From late 2000 onward however fully preserved CDC sites areaccessible Featured sites new entries success stories awards curric-ular units and prominent invitations to apply for project fundingpopulate the brightly coloured website The excitement is palpableJohn Manley then Industry Minister declared ldquoit is crucial that

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Canadarsquos young people get the skills and experience they need to par-ticipate in the knowledge-based economy of the futurerdquo and thatCDC ldquoshows that young people facing significant employment chal-lenges can contribute to and learn from the resources available onthe Information Highwayrdquo41 The documentation reflecting thelower level of technical expertise and scarcity of computer hardwareat the time was comprehensive participants were encouraged toidentify a collection digitize it process it (by cleaning up the data)preserve it on a hard drive and display it on the web Issues of copy-right were discussed as well as preservation issues and hardwarerequirements for computers (a computer with an Intel 386 processor8MB RAM a 1GB hard drive and 288bps modem was recom-mended) and training information for employers educators andparticipants42 Sample proposals community liaison ideas amidstother information set out a very comprehensive roadmap for a grass-roots bottom-up approach to digital preservation and heritage

Moving through the preserved websites is an interesting snap-shot at how web design standards changed Colour disappearsreplaced by a more subdued coordinated federal website standard(still largely in effect today) Surveys greet viewers from 2003onwards trying to learn how viewers discovered the site whetherthey think it is a good educational resource (they must have beenrelieved that the majority thought so mdash in 2003 some 84 percentfelt that it was a ldquogoodrdquo resource the best of its type)43 By 2004 theprogram was engaging in active remembrance of its past programsas it neared its ten-year anniversary Proposals stopped beingaccepted in July 2004 but the 600 webpages were still accessiblewith a comprehensive subject index alphabetical listing success sto-ries and detailed copyright information

The Canadarsquos Digital Collections site began to move towardsremoval in November 2006 and was initially completely gone A visi-tor to the site would learn that the content ldquois no longer availablerdquo andan e-mail address was provided for contributors to get their informationback44The information was completely inaccessible and by early 2007a bold notice announced that the now-empty site would close as of 2April 2007 Luckily by April 2007 a notice appeared with a declara-tion that the collections were now archived by Library and Archives

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Canada just in time for the website to completely disappear by July2007 In the archived version subject indexes and headings were goneAll that remained then and today for that matter is a straight-forwardalphabetical list and a disclaimer that noted ldquo[y]ou are viewing a doc-ument archived by Library and Archives Canada Please noteinformation may be out of date and some functionality lostrdquo

The question now is what we can do with this collection ofwebsites Could a distant reading as opposed to a close reading ona targeted level tell us something about this collectionFortunately we have a variety of tools that can help us navigate thislarge array of information several of which I use to navigate largeamounts of data First I create my own tools through computerprogramming This work is showcased in the following section I program in Wolfram Researchrsquos Mathematica 9 Mathematica is anintegrated platform for technical computing that allows users toprocess visualize and interact with exceptional arrays of informa-tion45 It is a language with an engaged and generous user base thatcomplements exhaustive well-written documentation fromWolfram Research To emphasize its ease one line of code couldimport a website in perfectly readable plaintext (avoiding unicodeerrors) as follows input=Import[ldquohttpactivehis-torycaldquordquoPlaintextrdquo] and a second one could make itall in lower text for textual analysis as such lowerpage=ToLowerCase[input] Variables are a critical building blockIn the former example ldquoinputrdquo now holds the entire plaintext of thewebsite ActiveHistoryca and then ldquolowerpagerdquo holds it in lowercaseOnce we have lower-cased information we are quickly able to extractthe following frequent words frequent phrases and could thus cre-ate a dynamic database of n-grams word clouds or run other textualanalysis programs on the material Much of the syntax adheres to arecognizable English-language format

Even more promising is Mathematicarsquos integration with theWolfram|Alpha database (itself accessible at wolframalphacom) acomputational knowledge engine that promises to put incredibleamounts of information at a userrsquos fingertips With a quick com-mand one can access the temperature at Torontorsquos Lester B PearsonInternational Airport at 500 pm on 18 December 1983 the rela-

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

tive popularity of names such as ldquoEdithrdquo or ldquoEmmardquo the Canadianunemployment rate in March 1985 or the Canadian GrossDomestic Product in December 199546 This allows one to quicklycross-reference findings with statistical information or find coinci-dental points of comparison As universities are now being activelyencouraged to contribute their information to Wolfram|Alpha thiswill become an increasingly powerful database

For many researchers the downside of Mathematica is a con-siderable one It is proprietary software and relatively expensive forstudent and non-student alike Outputs however can be dissemi-nated using the Computational Document Format (CDF) enablinghistorians to put their work online and entice the general public todynamically work with their models and findings

Programming is not the be all and end all Information visual-ization in the humanities is a growing international field withCanadian scholars heavily involved The most important and acces-sible example is Voyant Tools47 It is a powerful suite of textualanalysis tools Uploading onersquos own text or collection of texts draw-ing on a webpage or analyzing existing bodies of work one sees thefollowing a ldquoword cloudrdquo of a corpus click on any word to see itsrelative rise and fall over time (word trends) access the original textand click on any word to see its contextual placement within indi-vidual sentences It ably facilitates the ldquodistancerdquo reading of anyreasonably-sized body of textual information and should be a begin-ning point for those interested in this field The only downside isthat Voyant can choke on very large amounts of information as it isa textual analysis rather than robust data mining tool

There are several other complementary tools A notable one is theMany Eyes website hosted by IBM Research Many Eyes allows you toupload datasets be they textual tabular or numerical There are manyvisualization options provided you can plot relationships betweendata compare values (ie bar charts and bubble charts) track dataover time and apply information to maps The only downside toMany Eyes is that you must register and make your datasets available toall other registered researchers as well as IBM This limits your abilityto use copyrighted or proprietary information

More advanced technical users can avail themselves of a host of

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

textual analysis programs found elsewhere on the web The mostcomprehensive gathering of these tools can be found with theSoftware Environment for the Advancement of Scholarly Research(SEASR) and its Meandre workbench Meandre which can beinstalled as a local server on your system or mdash in a best case scenariomdash hosted by an institutional IT department allows you to tap intoa variety of workflows to analyze data48 One workflow mentionedin this article is the topic modeling toolkit found within the MachineLearning for Language Toolkit (MALLET) package which clustersunstructured data into ldquotopicsrdquo which appear together giving a quicksense of a documentrsquos structure49 Other workflows found withinSEASR include Dunningrsquos log-likelihood statistic which compares aselected document to a broader reference corpus For example youcould compare the lyrics that appear in 1968 against all lyrics in thepostwar era and learn what makes 1968 most unique50 sentimentanalyses which attempt to plot changes in emotional words andphrases over time and a variety of entity extraction algorithms(which find ldquopeoplerdquo ldquoplacesrdquo and so forth in a text) Corpuses thatwere previously too big to extract meaningful information from cannow be explored relatively quickly

As befits a technical side to history all of the freely availabletools discussed briefly here (notably wget SEASR and MALLET)are discussed in more technical terms in the peer-reviewedProgramming Historian 2 or on my own blog at ianmilliganca I haveelected to leave them online rather than provide commands in thisarticle they require some technical knowledge and directly workingfrom the web is quicker than here Please consider the following sec-tion an applied version of some of the lessons you can learn throughthese processes

From Web Collection to Dynamic Finding Aids

Let us briefly return to our archival metaphor Todayrsquos historiandealing with conventional sources often has to physically travel to anarchive requisition boxes perhaps take photographs of materials iftime is tight and otherwise take notes With born-digital sourceshowever the formula is reversed Our sources can come to us In a

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

few keystrokes we can bring an unparalleled quantity of informationto our home or work computers With Canadarsquos Digital Collectionsa single command line entry brought the entire archive onto my sys-tem it took days to process the information and I had to leave mycomputer on but it did not require any active user engagement

How big is the CDC collection in terms of sheer informationThe plaintext without any HTML markup is approximately7805135 words If each page contained 250 words it would be theequivalent of a 31220 page essay or 1000 dissertations However itis even more complicated than just reading 1000 dissertations Theinformation is spread across 73585 separate HTML files themselvespart of a much broader corpus of 278785 files Even the sheer sizeof the database requires big data methodologies an aggregate size of726 GB 360 MB of plaintext with HTML encoding reduced to alsquomerersquo 50MB if all encoding is removed A 50MB plain text docu-ment can quickly slow down even a relatively powerful personalcomputer There is no simple way to make this a conventionallyaccessible amount of information

How can we access this information Several different methodsare possible First we can work with the information on the internetitself or we can download it onto our own system I prefer the lattermethod as it decreases computational time adds flexibility andenables free rein to experiment with material Before you do thisthough you need to be familiar with the websitersquos ldquoterms of servicerdquoor the legal documentation that outlines how users are allowed to usea website resources or software In this case LAC provides generousnon-commercial reproduction terms information ldquomay be repro-duced in part or in whole and by any means without charge orfurther permission unless otherwise specifiedrdquo51

The CDC web archive index is organized like so the index islocated at wwwepelac-bacgcca100205301iccdcEAlphabetaspwith every subsequent website branching off the cdc folder Forexample one website would be hosted in wwwepelac-bacgcca100205301iccdcABbenevolat the next in hellipcdcabnatureand so on Using wget a free software program you can downloadall websites in a certain structure the program recursively downloadsinformation in a methodical fashion I first used wget to save the

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

cookies (little files kept on your system in this case one that notedthat you had seen the disclaimer about the archival nature of thewebsite) and then with an one-line command in my command lineto download the entire database52 Due to the imposition of a ran-dom wait and bandwidth limit which meant that I would downloadone file every half to one and a half seconds at a maximum of 30kilobytes a second the process took six days In the ProgrammingHistorian 2 I have an article expanding upon wget for humanitiesresearchers

Hypertext itself deserves a brief discussion The basic buildingblock of the web is the link one page is connected to another andpeople navigate through these webs While books and articles forexample are connected to each other by virtue of ideas and concretecitations these must often be inferred online HTML born-digitalsources are different in that the link offers a very concrete file One

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Visualizing the CDC Collection as a Series of Hyperlinks

quick visualization the circular cluster seen above lets us see theunique nature of this information We can go to each web pagescrape the links and see where they take viewers As we see in thebelow figure the CDC collection can be visualized as a series ofhyperlinks (each line) The cluster in the middle is the CDC collec-tion itself whereas the strands on the outside represent externalwebsites Using Mathematica I have generated a dynamic modelyou can move your mouse over each line learn what link it is andthus see where the collection fits into the broader Internet The siteis limited to following two links beyond the first page Through thisapproach we see the interconnection of knowledge and relationshipsbetween different sources At a glimpse we can see suggestions to fol-low other sites and determine if any nodes are shared amongst thevarious websites revealed

Visualization Moving Beyond Word Clouds

Where to begin with a large collection such as the Canadarsquos DigitalCollection corpus As a historian who is primarily interested in thewritten word my first step was to isolate the plaintext throughout thematerial so that I could begin rudimentary data mining and textualanalysis on the corpus For this I approached the CDC collection ontwo distinct levels mdash first the aggregate collection and second divid-ing the collection into each lsquofilersquo or individual website mdash derivingword and phrase data for each Imagine if you could look throughinto an archival collection and get a quick sense of what topics andideas were covered Taking that thought further imagine the utility ifyou could do this with individual archival boxes and even files Witha plethora of born-digital sources this is now a necessary step

On an aggregate level word and phrase data is a way to explorethe information within How often do various words within thenearly eight million words of the collection appear UsingMathematica we can quickly generate a list of words sorted by theirrelative frequency we learn that the word ldquoCanadardquo appears 26429times ldquocollectionsrdquo 11436 and ldquo[L]ouisbourgrdquo an astounding9365 times Reading such a list can quickly wear out a reader and isof minimal visual attractiveness Two quick decisions were made to

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

enhance data usability first the information was ldquocase-foldedrdquo allsent to lower-case this enables us to count the total appearances of aword such as ldquodigitalrdquo regardless of whether or not it appears at thebeginning of a sentence

Second when dealing with word frequency mdash as opposed tophrase frequency mdash ldquostop wordsrdquo should be removed as they werehere These are ldquoextremely common words which would appear to beof little value in helping select documents matching a user needrdquowords such as ldquotherdquo ldquoandrdquo ldquotordquo and so forth They matter forphrases (the example ldquoto be or not to berdquo is comprised entirely ofstop words) but will quickly dominate any word-level visualizationor list53 Other words may need to be added or removed from stoplists depending on the corpus for example if the phrase ldquopage num-berrdquo appeared on every page that would be useless information forthe researcher

The first step then is to use Mathematica to format word fre-quency information into Wordlenet readable format (if we put allof the text into that website it will crash mdash so we need to followthe instructions on that website) we can then draw up a wordcloud of the entire collection Here we can see both the advantagesand disadvantages of static word clouds What can we learn from this

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A Word Cloud of All Words in the CDC Corpus

illustration First without knowing anything about the sites we canimmediately learn that it is dealing with CanadaCanadians that itis a collection with documents in French and English that we aredealing with digitized collections and that there is a fair amount ofregional representation not based upon population (note thatldquoontariordquo is less frequent than both ldquonewfoundlandrdquo ldquonova scotiardquoand ldquoalbertardquo although ldquoqueacutebecrdquo and ldquoquebecrdquo is not terribly promi-nent) Certain kinds of history dominate communities familynames the built form (houses churches schools and so forth) At avery macro level we can learn several characteristics before doing anydeeper investigation

These forms of visualization need to be used with extreme cau-tion however First while visually attractive it leaves a severeamount of ambiguity are ldquoontariordquo and ldquoqueacutebecrdquo equally promi-nent for example or is there a slight divergence Even moreproblematically we cannot learn context from the above picturedoes the word ldquoschoolrdquo refer to built structures or educational expe-riences The same question with applies to ldquochurchrdquo Do mentionsof ldquonewfoundlandrdquo refer to historical experiences there or withexpatriates or in travel narratives Word clouds are useful first stepsand have their place but need to be used with extreme caution Formore scholarly applications tables offer much more precise inter-pretation As the second figure below demonstrates they can bemore helpful when dealing with the individual sites cuttingthrough ambiguity The figure below is a visualization of the web-site ldquopast and presentrdquo Despite the ambiguous title we can quicklylearn what this website contains from this visualization Wordle isthus a useful first step analogous to taking magically taking a quicklook through the side of an archival box We still need more infor-mation however to really make this form of textual analysishelpful To navigate this problem I have created my own dynamicfinding aid program using Mathematica which scales to any born-digital web collection

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Two approaches were taken First a textual representation of wordand phrase frequency allows us to determine the content of a given filerelatively quickly I wrote a short program to extract word frequency(stop words excluded) and phrase frequency For phrases we speak ofn-grams bigrams trigrams quadgrams and fivegrams although onecould theoretically do any higher number A bigram can be a combina-tion of two characters such as lsquobirsquo two syllables or mdash pertinent in ourcase mdash two words An example ldquocanada isrdquo A trigram is three wordsquadgram is four words and a fivegram is five words and so on

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Word Cloud of the File ldquoPast and Presentrdquo

A viewer for n-grams in the CDC Collection

The CDC collection demonstrates this approachrsquos utility Usingmy Mathematica textual CDC browser pictured above I randomlyexplored file 267 the ambiguously named ldquopast and presentrdquo web-site This is one of the largest individual websites 122363 words orabout 493 standard typed pages Such an ambiguous name could beabout almost anything Initially I decided to study word frequency

ldquoldquoalbertardquo 758rdquo ldquoldquolistenrdquo 490rdquo ldquoldquosettlersrdquo 467rdquoldquoldquonewrdquo 454rdquo ldquoldquocanadardquo 388rdquo ldquoldquolandrdquo 362rdquo ldquoldquoreadrdquo328rdquo ldquoldquopeoplerdquo 295rdquo ldquoldquosettlementrdquo 276rdquo ldquoldquochineserdquo268rdquo ldquoldquocanadianrdquo 264rdquo ldquoldquoplacerdquo 255rdquo ldquoldquoheritagerdquo254rdquo ldquoldquoedmontonrdquo 250rdquo ldquoldquoyearsrdquo 238rdquo ldquoldquowestrdquo 236rdquoldquoldquoraymondrdquo 218rdquo ldquoldquoukrainianrdquo 216rdquo ldquoldquocamerdquo 213rdquoldquoldquocalgaryrdquo 209rdquo ldquoldquonamesrdquo 206rdquo ldquoldquoranchrdquo 204rdquoldquoldquoupdatedrdquo 193rdquo ldquoldquohistoryrdquo 190rdquo ldquoldquocommunitiesrdquo183rdquo hellip

Rather than having to visit the website at a glance we can almostimmediately see what the site is about More topics become apparentwhen one looks at trigrams for example in this automatically gen-erated output of generated top phrases

ldquoldquofirstrdquo ldquopeoplerdquo ldquosettlersrdquo 100rdquo ldquoldquotherdquo ldquounitedrdquoldquostatesrdquo 72rdquo ldquoldquoonerdquo ldquoofrdquo ldquotherdquo 59rdquo ldquoldquoofrdquo ldquotherdquoldquowestrdquo 58rdquo ldquoldquosettlersrdquo ldquolastrdquo ldquoupdatedrdquo 56rdquo ldquoldquopeoplerdquoldquosettlersrdquo ldquolastrdquo 56rdquo ldquoldquoalbertansrdquo ldquofirstrdquo ldquopeoplerdquo 56rdquoldquoldquoadventurousrdquo ldquoalbertansrdquo ldquofirstrdquo 56rdquo ldquoldquotherdquo ldquobarrdquoldquourdquo 55rdquo ldquoldquoopeningrdquo ldquoofrdquo ldquotherdquo 54rdquo ldquoldquocommunitiesrdquoldquoadventurousrdquo ldquoalbertansrdquo 54rdquo ldquoldquonewrdquo ldquocommunitiesrdquoldquoadventurousrdquo 54rdquo ldquoldquobarrdquo ldquourdquo ldquoranchrdquo 53rdquo ldquoldquolistenrdquoldquotordquo ldquotherdquo 52rdquo hellip

Through a few clicks we can navigate and determine the basiccontours mdash from a distance mdash of what the site pertains to In aviewer you can use a slider to dynamically see more informationor less information quickly toggling between bigrams trigramsand quadgrams allowing you to quickly hone in on immediately

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

relevant information This can be useful but we need to moveback even further to achieve a distant reading of this born-digitalcollection

If one zooms back out to this very distant level we can extractn-gram data from the entire collection We have two approachesfirst aggregate appearances of a given term (if ldquoaboriginalrdquo appears457 times then n=457) or second relative appearances of a giventerm (if ldquoaboriginalrdquo appears in 45 of 384 sites then n=45384 or01171875) In my other work on music lyrics the latter optionenables you to control for the lsquochorusrsquo effect as the former can be drowned out by one song with repetitive words Given thenature of this collection I elected for the former approach Whatcan we learn from trigrams for example After an initial set of trigrams pertaining to the Louisbourg digitization project whichhas boilerplate language on each page as well as common ldquostop-wordrdquo equivalents of phrases (especially pronounced in French) we come across the first set of meaningful phrases excerptedbelow

ldquoexperiencerdquo ldquoinrdquo ldquotherdquo 919 ldquotransferredrdquo ldquotordquo ldquotherdquo918 ldquofundedrdquo ldquobyrdquo ldquotherdquo 918 ldquoofrdquo ldquotherdquo ldquoprovincialrdquo915 ldquothisrdquo ldquowebrdquo ldquositerdquo 915 ldquo409rdquo ldquoregistryrdquo ldquo2rdquo914 ldquoarerdquo ldquonordquo ldquolongerrdquo 914 ldquogouvernementrdquo ldquodurdquo ldquocanadardquo 910 ldquouserdquo ldquothisrdquo ldquolinkrdquo 887ldquopagerdquo ldquoorrdquo ldquoyourdquo 887 ldquofollowingrdquo ldquopagerdquo ldquoorrdquo 887 ldquoautomaticallyrdquo ldquotransferredrdquo ldquotordquo 887 ldquoberdquoldquoautomaticallyrdquo ldquotransferredrdquo 887 ldquowillrdquo ldquoberdquo ldquoautomaticallyrdquo 887 ldquoindustryrdquo ldquoyourdquo ldquowillrdquo 887ldquomultimediardquo ldquoindustryrdquo ldquoyourdquo 887 ldquopracticalrdquo ldquoexperiencerdquo ldquoinrdquo 887 ldquogainedrdquo ldquopracticalrdquo ldquoexperiencerdquo887 ldquotheyrdquo ldquogainedrdquo ldquopracticalrdquo 887 ldquowhilerdquo ldquotheyrdquoldquogainedrdquo 887 ldquocanadiansrdquo ldquowhilerdquo ldquotheyrdquo 887ldquoyoungrdquo ldquocanadiansrdquo ldquowhilerdquo 887 ldquoshowcaserdquo ldquoofrdquoldquoworkrdquo 887 ldquoexcellentrdquo ldquoshowcaserdquo ldquoofrdquo 887 hellip

Here we see an array of common language funding informa-tion link information project information spread across over 887

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

websites Reading down we see information about governments jobexperience project information initiatives re-structuring and soforth At a glance we can see what language is common to all web-sites as opposed to unique information Moving down we find themore common but less universal ldquothe history ofrdquo for example ldquoher-itage community foundationrdquo libraries and so forth By the end ofthe list we can move into the unique or very rare phrases that appearonly once twice or three times

Moving beyond word and phrase frequency there are a fewmore sophisticated off-the-shelf options that can help us make senseof large arrays of unstructured data Topic modeling is among themost developed as Princetonrsquos David M Blei (a Computer Scienceprofessor) describes in a general introduction to the topic

While more and more texts are available online we simplydo not have the human power to read and study them toprovide the kind of [traditional] browsing experiencedescribed above To this end machine learning researchershave developed probabilistic topic modeling a suite of algo-rithms that aim to discover and annotate large archives ofdocuments with thematic information Topic modelingalgorithms are statistical methods that analyze the words ofthe original texts to discover the themes that run throughthem how those themes are connected to each other andhow they change over time hellip Topic modeling enables usto organize and summarize electronic archives at a scalethat would be impossible by human annotation55

Using the aforementioned SEASR Meandre workbench (or viaMALLET itself and a command-line interface) we can quickly beginto run our born-digital information through the topic-modelingalgorithm The results are notable and come in two formats a wordcloud visualization as well as an XML file containing a line-by-linebreakdown of each topic Taking the aforementioned file 267 (ldquopastand presentrdquo) we can begin to get a sense of the topics this large web-site discusses

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

At a glance we can learn about some of the main themes the struc-ture of Alberta (Alberta towns bridges lakes) the geographic shapeof the area (life land lands areas links grass and interestingly lan-guage) the stories (story knight archives but also interestinglyknight and wife) and so forth The word cloud gives us a sense ofthe most relevant terms that define each topic but we also have a3200-line XML file broken into topics which lets us further refineour inquiry

A last set of examples will help us move from the aggregate levelof hundreds of thousands of lines of text to the individual line inessence creating our own search engine to navigate this collection

There are a few good reasons why we should not use Google todo all of this in our own work First we do not know how Googleworks precisely It is a proprietary black box While the original algo-rithm is available the actual methods that go into determining theranking of search terms is a trade secret Second Google is optimizedfor a number of uses mainly to deliver the most relevant informa-tion to the user Historians and archivists are looking often forspecific information within old websites and it behooves us to exam-ine them through the lens of more specialized tools With thesehesitations in mind I have sought to develop my own rudimentarysearch engine for this born-digital collection

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A two-pronged approach can take us to relevant informationquickly In the following example I am carrying out a project onblack history We quickly search all of the CDC word frequency filesfor the occurrence of the word ldquofreedomrdquo ldquoblackrdquo and ldquolibertyrdquo(these terms of course require some background knowledge thesearch is shaped by the user) obtain a list and then map it onto achart in order to quickly visualize their locations in the collection Forexample in the graph above we see a visualization of a given term inthis case our three search terms As one moves the mouse over eachbar the frequency data for a given term appears We can also zoomand enhance this graph to use it as we wish A demonstration of thiscan be found online at wwwianmilliganca20130118cdc-tool-1

We can then move down into the data In the example aboveldquo047rdquo labeled ldquoblackgoldtxtrdquo contains 31 occurrences of the wordldquoontariordquo Moving to my Mathematica keyword-in-context routinewe can see where these occurrences of the word appear

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

A snapshot of the dynamic finding aid tool

We see that this website focuses mostly on southwestern Ontariowhere many oil fields are If we move forward we can see that it per-tains to the nineteenth century a particular region as well assubsequent memorialization activities (mention of a museum forexample) All this can be done from within the textual client asopposed to going out into individual websites Broad contoursemerge from this distant reading but they also facilitate deeper read-ing by allowing us to promptly realize critical sources

Another model through which we can visualize documents isthrough a viewer primarily focusing on term document frequencyand inverse frequency terms discussed below To prepare this mater-ial we are combining our earlier data about word frequency (that theword ldquoexperiencerdquo appears 919 times perhaps) with aggregate infor-mation about frequency in the entire corpus (ie the wordldquoexperiencerdquo appears 1766 times) The number of times that a termappears in a given website can be defined as tf (term frequency)Another opportunity is inverse document frequency where we canlearn what words are most distinct and unique in a given documentThe formula for this is the oft-cited tf-idf a staple of informationretrieval theory or tf times the logarithm of the number of docu-ments divided into the number of documents where the given termappears At a glance then we can see some of the unique words con-tained within a website if we select ldquoinverse document frequencyrdquo

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

At a very quick glance one can see the major words that make this aunique document This technique can be used on any textual docu-ment

Off-the-shelf tools can facilitate a deeper yet distant reading ofindividual websites as well especially the suite of research tools onwebsites such as Many Eyes and Voyant Using these tools howeverrequires that one puts individual data into each separate programrather than quickly navigating between them This stems from themassive amount of files and data involved Many Eyes limits files to5MB and Voyant has no hard and fast rule but can struggle withlarger files

This analysis was carried out on my personal computer55 Largerdata sets may necessitate a corresponding increase in computationalpower bringing social sciences and humanities researchers into con-tact with High Performance Computing (HPC) centres or consortiaThese extremely powerful computers provide researchers with com-puting power orders of magnitude above that of personal computersallowing results in hours instead of the weeks or months it might

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

take to process on a smaller system56 This has particular applicationin the basic processing of a very large web archive There are limita-tions however jobs must be submitted into queues in advancerather than on-site processing and perhaps most importantly HPCcentres prefer to run highly optimized code to minimize usage oftheir very expensive and scarcely allocated equipment As this level ofcoding expertise is often beyond humanities researchers researchteams must be formed

Humanities researchers would do well to familiarize themselveswith local HPC options many of which are more than happy towork on new and engaging questions These are accessible optionsin Ontario for example all but one university belongs to a HPCconsortium57 Yet as personal computing continues to dramaticallyincrease in speed and as storage prices continue to plummet textualanalysis is increasingly feasible on high-quality office systems An ini-tial exposure can generate questions and problems that if the needrears itself can subsequently be taken to the supercomputer

This case study is intended to provide one model for distantreading and the generation of finding aids in the interest of showingsome of the potential for working with born-digital sources It isintended to begin or jumpstart a conversation around the resourcesand skills necessary for the next generation A seemingly insur-mountable amount of information contained across over 73000files (as in the CDC collection) can be quickly navigated and iso-lated for researcher convenience

Conclusion Digital Literacy for the Next Generation ofHistorians

As a discipline I would like to suggest one path forward prepare thenext generation of historians at least some of them to deal with thedeluge of born-digital sources Fundamentally the challenge of born-digital sources is a pedagogical one For existing historians this is anissue of continual education for future historians an issue of profes-sional training (the annual Digital Humanities Summer Instituteheld at the University of Victoria each June is worth considering asan introduction to many digital concepts) Some off-the-shelf tools

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

can serve as gateways to deeper engagement notably the VoyantTools suite (wwwvoyant-toolsorg) and perhaps subsequently thelessons contained in the Programming Historian 2 That said digitalliteracy courses should become part of the essential pantheon of his-torical curricula This should not be mandatory but instead taughtalongside and within traditional historiography and research meth-ods courses In an era of diminishing resources and attention paid toknowledge translation and relevancy digital literacy courses are alsowell positioned to provide graduates with concrete portfolios andactionable skills If we want to be prepared to use digital method-ologies to study the post-1990 world we will need ready historians

Such a course should have six basic priorities First we must pro-vide historiographical context for new students on whatcomputational methods have been tried before How have quantita-tive and qualitative methods differed yet ultimately complementedeach other Second students in this course would need to develop anunderstanding of the digital basics cloud computing effectivebackup and versioning digital security and an understanding of howthe Internet is organized Third for effective cross-departmental andsustained engagement with digital history students must to learn howto digitally navigate conventional sources notable skill sets wouldinclude a familiarity with citation management software databasesand the possibility of using optical character recognition on a varietyof born-paper sources subsequently digitized through scanners orcameras This must also include a grasp of the scope and significanceof large digital depositories including Google Books the InternetArchive and others mentioned earlier Fourth students must learnhow to process large amounts of digital information this includes off-the-shelf visualization tools material discussed here and some of thephilosophical and methodological implications of a distance readingFifth students can learn basic programming through the ProgrammingHistorian Finally and perhaps most importantly students need anunderstanding of various ways to openly disseminate material on the web whether through blogging micro-blogging publishing webpages and so forth it is important to enable the great work that we do in the university to make its mark on the world Throughthis students will also come away from history departments with

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

comprehensive and marketable digital portfolios Such a coursewould focus more on homework than reading with special emphasisplaced on deliverable content such as websites interactive modelsand broadly implementable databases

Digital history and the readiness to work with born-digitalsources cannot stem out of pedagogy alone however Institutionalsupport is necessary While based on anecdotal evidence it appearsthat some university administrators have begun to embrace the shifttowards the digital humanities At the federal level SSHRC has iden-tified the Digital Economy as one of its priority areas58 Yet when itcomes to evaluation doctoral and post-doctoral researchers do nothave a ldquodigital humanitiesrdquo option open to them in terms of assessingcommittees leading to fears that their projects might be lost inbetween disciplinary areas of focus59 In light of the impending crisisof born-digital sources historians should encourage these initiatives

These sorts of digital projects do not always lend themselveswell to traditional publication venues The examples above from theCDC case study for example are best viewed as dynamically mal-leable models this will be one challenge as historians grapple withnew publication models This article argues first why digitalmethodologies are necessary and second how we can realize aspectsof this call The Internet is 20 years old and before historians realizeit we will be faced with the crisis of born-digital information over-load If history is to continue as the leading discipline inunderstanding the social and cultural past decisive movementtowards the digital humanities is necessary When the sea comes inwe must be ready to swim

IAN MILLIGAN is an assistant professor of Canadian history at theUniversity of Waterloo He is a founding co-editor of ActiveHistorycaan editor-at-large of the Programming Historian 2 and has written sev-eral articles in the fields of Canadian youth digital and labour history

IAN MILLIGAN est professeur adjoint en histoire canadienne agrave laUniversity of Waterloo Il est le fondateur et codirecteur du site

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Internet ActiveHistoryca ainsi qursquoun eacutediteur du site ProgrammingHistorian 2 Il a eacutecrit plusieurs articles au sujet de la jeunesse cana-dienne de la numeacuterisation et de lrsquohistoire des travailleurs

Endnotes

1 The question of when the past becomes a ldquohistoryrdquo to be professionallystudied is a tricky one Yet it is worth noting that the still-definitivework on the Baby Boom Generation appeared in 1996 with DougOwram Born at the Right Time A History of the Baby Boom Generationin Canada (Toronto University of Toronto Press 1996) This actuallyfollowed the earlier scholarly monograph Cyril Levitt Children ofPrivilege Student Revolt in the Sixties (Toronto University of TorontoPress 1984) and a host of other laudable works As academicdebatesworks on the 1960s now suggest the past becomes historysooner than we often expect

2 A critical point advanced in the field of literary criticism and theory byStephen Ramsay Reading Machines Towards an Algorithmic Criticism(Urbana University of Illinois Press 2011)

3 See ldquoIdleNoMore Timelinerdquo ongoingwwwidlenomoremakookcatimeline44 ltviewed 18 January 2013gt

4 These issues have been discussed elsewhere albeit in either a moreabridged form or with a focus on digitized historical sources A notableexception is Daniel J Cohen and Roy Rosenzweig Digital History AGuide to Gathering Preserving and Presenting the Past on the Web(Philadelphia University of Pennsylvania Press 2005) specifically theirchapter on ldquoCollecting History onlinerdquo Their focus is on the preservation-ist and archival challenges however Important notes are also found inDan Cohen ldquoHistory and the Second Decade of the Webrdquo RethinkingHistory 8 no 2 (June 2004) 293ndash301 also available online atwwwchnmgmueduessays-on-history-new-mediaessaysessayid=34

5 Figure taken from Jose Martinez ldquoTwitter CEO Dick Costolo Reveals Staggering Number of Tweets Per Dayrdquo ComplexTech wwwcomplexcomtech201210twitter-ceo-dick-costolo-reveals-staggering-number-of-tweets-per-day ltviewed 6 January 2013gtFurther more while a tweet itself is only 140 characters every tweetcomes with over 40 lines of metadata date information author biogra-phy and location information when the account was created the usersrsquofollowing and follower numbers country data and so forth This is aremarkable treasure trove of information for future historians but also apotential source of information overload See Sarah Perez ldquoThis is What

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

a Tweet Looks Likerdquo ReadWriteWeb (19 April 2010) wwwreadwritewebcomarchivesthis_is_what_a_tweet_looks_likephpltviewed 6 January 2013gt

6 This is further discussed in Ian Foster ldquoHow Computation ChangesResearchrdquo in Switching Codes Thinking Through Digital Technology inthe Humanities and the Arts eds Thomas Bartscherer and RoderickCoover (Chicago University of Chicago Press 2011) esp 22ndash4

7 Franco Moretti Graphs Maps Trees Abstract Models for Literary History(New York Verso 2005) 3ndash4

8 While not of prime focus here the longue dureacutee deserves definition Itstands opposite the history of events of instances covering instead avery long time span in an interdisciplinary social science framework In along essay Braudel noted that one can then see both structural crises ofeconomies and structures that constrain human society and develop-ment For more see Fernand Braudel ldquoHistory and the Social SciencesThe Longue Dureacuteerdquo in Fernand Braudel On History trans SarahMatthews (Chicago University of Chicago Press 1980) 27 The essaywas originally printed in the Annales ESC no 4 (OctoberndashDecember1958)

9 A great overview can be found at ldquoText Analysisrdquo Tooling Up for DigitalHumanities 2011 wwwtoolingupstanfordedupage_id=981 ltviewed19 December 2012gt

10 Jean-Baptiste Michel Yuan Kui Shen Aviva Presser Aiden Adrian VeresMatthew K Gray The Google Books Team Joseph P Pickett DaleHoiberg Dan Clancy Peter Norvig Jon Orwant Steven Pinker MartinA Nowak Erez Lieberman Aiden ldquoQuantitative Analysis of CultureUsing Millions of Digitized Booksrdquo Science 331 (14 January 2011)176ndash82 The tool is publicly accessible at ldquoGoogle Books NgramViewerrdquo Google Books 2010 wwwbooksgooglecomngrams ltviewed11 January 2013gt

11 Anthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

12 Comment by Jean-Baptiste Michel and Erez Lieberman Aiden onAnthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

13 For more information see the Canadian Century Research InfrastructureProject website at wwwccriuottawacaCCRIHomehtml and the Programme de recherche en deacutemographie historique at wwwgeneologyumontrealcafrleprdhhtm

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

14 ldquoWith Criminal Intentrdquo CriminalIntentorg wwwcriminalintentorgltviewed 20 December 2012gt

15 For an overview see ldquoAward Recipients mdash 2011rdquo Digging into DataWebpage wwwdiggingintodataorgHomeAwardRecipients2011tabid185Defaultaspx ltviewed 20 December 2012gt

16 One can do no better than the completely free Programming Historianwhich uses entirely free and open-source programming tools to learn thebasics of computational history See The Programming Historian 2 2012-Present wwwprogramminghistorianorg ltviewed 11 January 2013gt

17 Michael Katz The People of Hamilton Canada West Family and Class ina Mid-Nineteenth-Century City (Cambridge MA Harvard UniversityPress 1975) Also see A Gordon Darroch and Michael D OrnsteinldquoEthnicity and Occupational Structure in Canada in 1871 The VerticalMosaic in Historical Perspectiverdquo Canadian Historical Review 61 no 3(1980) 305ndash33

18 Ian Anderson ldquoHistory and Computingrdquo Making History The ChangingFace of the Profession in Britain 2008 wwwhistoryacukmakinghistoryresourcesarticleshistory_and_computinghtml ltviewed 20 December2012gt

19 Christopher Collins ldquoText and Document Visualizationrdquo TheHumanities and Technology Camp Greater Toronto Area VisualizationBootcamp University of Ontario Institute of Technology Oshawa Ont21 October 2011

20 The quotations by Luther and Poe are commonly used to make thispoint For the best example see Clay Shirky Cognitive SurplusCreativity and Generosity in a Connected Age Google eBook Penguin2010 For Mumford see Lewis Mumford The Myth of the Machine vol2 The Pentagon of Power (New York Harcourt Brace 1970) 182 asquoted in Gleick The Information A History A Theory A Flood (NewYork Pantheon 2011) 404

21 Adam Ostrow ldquoCisco CRS-3 Download Library of Congress in JustOver One Secondrdquo Mashable (9 March 2010) accessed onlinewwwmashablecom20100309cisco-crs-3

22 A note on data size is necessary The basic unit of information is a ldquobyterdquo (8 bits) which is roughly a single character A kilobyte (KB) is athousand bytes a megabyte (MB) a million a gigabyte (GB) a billion aterabyte (TB) a trillion and a petabyte (PB) a quadrillion Beyond thisthere are exabytes (EB) zettabytes (ZB) and yottabytes (YB)

23 These figures can of course vary depending on compression methodschosen Their math was based on an average of 8MB per book if it wasscanned at 600dpi and then compressed There are 26 million books inthe Library of Congress For more see Mike Ashenfelder ldquoTransferringlsquoLibraries of Congressrsquo of Datardquo Library of Congress Digital Preservation

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Blog 11 July 2011wwwblogslocgovdigitalpreservation201107transferring-libraries-of-congress-of-data ltviewed 11 January 2013gt

24 Gleick 232 25 ldquoAbout the Library of Congress Web Archivesrdquo Library of Congress Web

Archiving FAQs Undated wwwlocgovwebarchivingfaqhtmlfaqs_02ltviewed 11 January 2013gt

26 ldquo10000000000000000 Bytes Archivedrdquo Internet Archive Blog 26October 2012 wwwblogarchiveorg2012102610000000000000000-bytes-archivedltviewed 19 December 2012gt

27 Jean-Baptiste Michel et al 176ndash82 Published Online Ahead of Print16 December 2010 wwwsciencemagorgcontent3316014176ltviewed 29 July 2011gt See also Scott Cleland ldquoGooglersquos PiracyLiabilityrdquo Forbes 9 November 2011 wwwforbescomsitesscottcle-land20111109googles-piracy-liability ltviewed 11 January 2013gt

28 ldquoGovernment of Canada Web Archiverdquo Library and Archives Canada 17October 2007 wwwcollectionscanadagccawebarchivesindex-ehtmlltviewed 11 January 2013gt

29 For background see Bill Curry ldquoVisiting Library and Archives inOttawa Not Without an Appointmentrdquo Globe and Mail 1 May 2012wwwtheglobeandmailcomnewspoliticsottawa-notebookvisiting-library-and-archives-in-ottawa-not-without-an-appointmentarticle2418960 ltviewed 22 May 2012gt and the announcement ldquoLAC Begins Implementation of New Approach to service Deliveryrdquo Library and Archives Canada February 2012wwwcollectionscanadagccawhats-new013-560-ehtml ltviewed 22May 2012gt For professional responses see the ldquoCHArsquos comment onNew Directions for Library and Archives Canadardquo CHA Website March2012 wwwcha-shccaenNews_39items19html ltviewed 19December 2012gt and ldquoSave Library amp Archives Canadardquo CanadianAssociation of University Teachers launched AprilMay 2012 wwwsavelibraryarchivesca ltviewed 19 December 2012gt

30 Ian Milligan ldquoThe Smokescreen of lsquoModernizationrsquo at Library andArchives Canadardquo ActiveHistoryca 22 May 2012wwwactivehistoryca201205the-smokescreen-of-modernization-at-library-and-archives-canada ltviewed 19 December 2012gt

31 ldquoSave LAC May 2012 Campaign Updaterdquo SaveLibraryArchivesca May2012 wwwsavelibraryarchivescaupdate-2012-05aspx ltviewed 19December 2012gt

32 The challenges and opportunities presented by such projects is extensively discussed in Daniel J Cohen ldquoThe Future of Preserving the Pastrdquo CRM The Journal of Heritage Stewardship 2 no 2

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

(Summer 2005) 6ndash19 also available freely online at the RoyRosenzweig Center for History and New Mediawwwchnmgmueduessays-on-history-new-mediaessaysessayid=39For more see ldquoSeptember 11 Digital Archive Saving the Histories of September 11 2001rdquo Centre for History and New Media andAmerican Social History Project Center for Media and Learningwww911digitalarchiveorg ltviewed 11 January 2013gt ldquoHurricaneDigital Memory Bank Collecting and Preserving the Stories of Katrina and Ritardquo Center for History and New Media wwwhurricanearchiveorg ltviewed 11 January 2013gt and ldquoOccupyArchive Archiving the Occupy Movements from 2011rdquo Centre forHistory and New Media wwwoccupyarchiveorg ltviewed 11 January2013gt

33 Daniel Cohen and Roy Rosenzweig Digital History A Guide to Gathering Preserving and Presenting the Past on the Web(Philadelphia Temple University Press 2005) chapter on ldquoPreservingDigital Historyrdquo The chapter is available fully online atwwwchnmgmuedudigitalhistorypreserving5php ltviewed 19December 2012gt

34 For an overview see ldquoPreserving Our Digital Heritage The NationalDigital Information Infrastructure and Preservation Program 2010 Report a Collaborative Initiative of the Library of CongressrdquoDigitalpreservationgov wwwdigitalpreservationgovmultimediadocumentsNDIIPP2010Report_Postpdf ltviewed 19 December 2012gt

35 Steve Meloan ldquoNo Way to Run a Culturerdquo Wired 13 February 1998available online via the Internet Archivewwwwebarchiveorgweb20000619001705httpwwwwiredcomnewsculture012841030100html ltviewed 19 December 2012gt

36 Some of this has been discussed in greater depth on my website See theseries beginning with Ian Milligan ldquoWARC Files A Challenge forHistorians and Finding Needles in Haystacksrdquo ianmilliganca 12December 2012 wwwianmilliganca20121212warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks ltviewed 20 December2012gt

37 Unstructured data as opposed to structured data is important for histo-rians but brings additional challenges Some projects have producedstructured data novels marked up using XML encoding for example soone can immediately see place names dates character names and soforth This makes it much easier for a computer to read and understand

38 See William Noblett ldquoDigitization A Cautionary Talerdquo New Review ofAcademic Librarianship 17 (2011) 3 Tobias Blanke Michael Bryantand Mark Hedges ldquoOcropodium Open Source OCR for Small-ScaleHistorical Archivesrdquo Journal of Information Science 38 no 1 (2012) 83

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 2: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

Mining the lsquoInternet Graveyardrsquo Rethinking theHistoriansrsquo Toolkit

IAN MILLIGAN

Abstract

ldquoMining the Internet Graveyardrdquo argues that the advent of massivequantity of born-digital historical sources necessitates a rethinking of thehistoriansrsquo toolkit The contours of a third wave of computational historyare outlined a trend marked by ever-increasing amounts of digitizedinformation (especially web based) falling digital storage costs a moveto the cloud and a corresponding increase in computational power toprocess these sources Following this the article uses a case study of anearly born-digital archive at Library and Archives Canada ndash CanadarsquosDigital Collections project (CDC) ndash to bring some of these problems intoview An array of off-the-shelf data analysis solutions coupled with codewritten in Mathematica helps us bring context and retrieve informationfrom a digital collection on a previously inaccessible scale The articleconcludes with an illustration of the various computational tools avail-able as well as a call for greater digital literacy in history curricula andprofessional development

Reacutesumeacute

Dans cet article lrsquoauteur soutient que la production drsquoune grande quan-titeacute de sources historiques numeacuteriques neacutecessite une reacuteeacutevaluation du coffreagrave outils des historiennes et des historiens Une troisiegraveme vague drsquohistoireinformatique marqueacutee par un nombre toujours croissant drsquoinformationsnumeacuteriseacutees (surtout dans le cadre drsquoInternet) la chute des coucircts drsquoentre-

My sincerest thanks to William Turkel who sent me down the program-ming rabbit hole Thanks also to Jennifer Bleakney Thomas Peace andthe anonymous peer reviewers for their suggestions and assistanceFinancial support was generously provided by the Social Sciences andHumanities Research Council of Canada

JOURNAL OF THE CHA 2012 New Series Vol 23 no 2

REVUE DE LA SHC 2012Nouvelle seacuterie vol 23 no 2

posage des donneacutees numeacuteriques le deacuteveloppement des nuages informa-tiques et lrsquoaugmentation parallegravele de la capaciteacute drsquoutiliser ces sourcesbouleverse deacutejagrave la pratique historienne Cet article se veut une eacutetude decas baseacutee sur ces observations Il eacutetudie plus particuliegraverement un projetde numeacuterisation de Bibliothegraveque et Archives Canada mdash les Collectionsnumeacuteriseacutees du Canada mdash pour eacuteclairer certains deacutefis ci-haut mention-neacutes Un ensemble de solutions precirctes agrave utiliser pour lrsquoanalyse de donneacuteesjumeleacute avec un code informatique eacutecrit en Mathematica peut contri-buer agrave retracer le contexte et agrave retrouver des informations agrave partir drsquounecollection numeacuterique preacuteceacutedemment inaccessible aux chercheurs eacutetantdonneacute sa taille Lrsquoarticle se termine par une preacutesentation des diffeacuterentsoutils informatiques accessibles aux historiens ainsi que par un appel agravelrsquoacquisition drsquoune plus grande culture numeacuterique dans les curriculumsen histoire et dans le deacuteveloppement professionnel des historiens

The information produced and consumed by humankindused to vanish mdash that was the norm the default Thesights the sounds the songs the spoken word just meltedaway Marks on stone parchment and paper were the spe-cial case It did not occur to Sophoclesrsquo audiences that itwould be sad for his plays to be lost they enjoyed theshow Now expectations have inverted Everything may berecorded and preserved at least potentially James Gleick The Information A History a Theory aFlood 396ndash7

Almost every day almost every Canadian generates born-digital infor-mation that is fast becoming a sea of data Historians must learn howto navigate this sea of digital material Where will the historian of thefuture go to research todayrsquos history What types of sources will theyuse We are currently experiencing a revolutionary medium shift Asthe price of digital storage plummets and communications are increas-ingly carried through digitized text pictures and video the primarysources left for future historians will require a significantly expandedand rethought toolkit The need to manage a plethora of born-digitalsources (those originally digitally created) as an essential part of historywill be on us soon While there is no commonly accepted rule-of-

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

thumb for when a topic becomes ldquohistoryrdquo it is worth noting as anexample that it took less than 30 years after the tumultuous year of1968 for a varied developed and contentious North American histo-riography to appear on the topic of life in the 1960s1 In 2021 it willhave been a similar 30 years since in August 1991 Tim Berners-Leea fellow at the European Organization for Nuclear Research (CERN)published the very first website and launched the World Wide WebProfessional historians need to be ready to add and develop new skillsto deal with sources born in a digital age

Digital sources necessitate a rethinking of the historianrsquos toolkitBasic training for new historians requires familiarity with newmethodological tools and making resources for the acquisition ofthese tools available to their mentors and teachers Not only willthese tools help in dealing with more recent born-digital sources butthey will also help historians capitalize on digitized archival sourcesfrom our more-distant past This form of history will not replace ear-lier methodologies but instead play a meaningful collaborative andsupportive role It will be a form of distant reading encompassingthousands or tens of thousands of sources that will complement themore traditional and critically important close reading of smallbatches of sources that characterizes so much of our work Indeedmuch of what we think of as ldquodigital historyrdquo may simply be ldquohis-toryrdquo in the years to come2

There are a number of avenues along which this transformationwill take place First it is important to establish a map that situatesthis emerging field into what has come before Second I examine thecurrent situation of born-digital sources and historical practiceThird I use an early born-digital archive from Library and ArchivesCanada (LAC) Canadarsquos Digital Collections project (CDC) to eval-uate how we access and make sense of digital information today Ithen apply emergent visualization techniques to the sources in thisarchive to show one way through which historians can tackle born-digital source issues Lastly I will lay out a road map to ensure thatthe next generation has enough basic digital literacy to ensure thathistorians can adequately analyze this new form of source material

Imagine a future historian taking on a central question of socialand cultural life in the middle of the first decade of the twenty-first

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

century such as how Canadians understood Idle No More or theOccupy movement through social media What would her archivelook like I like to imagine it as boxes stretching out into the dis-tance tapering off without any immediate reference points to bringit into perspective Many of these digital archives will be withouthuman-generated finding aids would have perhaps no dividers orheadings within ldquofilesrdquo and certainly no archivist with a compre-hensive grasp of the collection contents A recent example isillustrative During the IdleNoMore protests Twitter witnessed anastounding 55334 tweets on 11 January 2013 Each tweet can be up to 140 characters To put that into perspective this article is lessthan 900 lines long in a word processor Yet this massive amount ofinformation looks cryptic strange usernames no archives no fold-ers not even an immediate way to see whether a given tweet isrelevant to your research or not3 This is a vast collection Informationoverload

This example is indicative of the issues facing the historical pro-fession4 I am not exaggerating Every day half a billion tweets aresent5 hundreds of thousands of comments are uploaded to newssites and scores of blog posts are uploaded This makes up one smallaspect of broader data sets automated logs including climate read-ings security access reports library activity search patterns booksand movies ordered from Amazonca or Netflix6 Even if only a smallfraction of this is available to future historians it will represent anunparalleled source of information about how life was lived (orthought to have been lived) during the early-twenty-first century Toprovide another perhaps more tangible example while carrying outmy previous work on youth cultures in 1960s English-Canada I wasstruck by how inaccessible television sources were compared to news-paper records despite television being arguably the primary media ofthe time Unlike the 1960s however the Internet leaves a plethoraof available mdash if difficult to access mdash primary sources for historiansIn the digital age archiving will be different which means that thehistory we write will be different

We must begin planning for this plethora of born-digital sourcesnow Using modern technology we have the ability to quickly findneedles in digital haystacks to look inside a lsquoboxrsquo and immediately

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

find that relevant letter rather than spending hours searching andreading everything Yet we also have the ability to pull our gaze backdistantly elucidating the context to our relevant documents as well

The Necessity of Distant Reading Historians andComputational History

Franco Moretti in his 2005 groundbreaking work of literary criti-cism Graphs Maps Trees called on his colleagues to rethink theircraft With the advent of mass produced novels in nineteenth-cen-tury Britain Moretti argued that scholars must change theirdisciplinary approach Traditionally literary scholars worked with acorpus of around 200 novels though an impressive number for theaverage person it represents only around one percent of the totaloutput of that period To work with all of this material in an effortto understand these novels as a genre critics needed to approachresearch questions in a different way

[C]lose reading wonrsquot help here a novel a day every day ofthe year would take a century or so hellip And itrsquos not evena matter of time but of method a field this large cannotbe understood by stitching together separate bit of knowl-edge about individual cases because it isnrsquot a sum ofindividual cases itrsquos a collective system that should begrasped as such as a whole7

Instead of close reading Moretti called for literary theorists to prac-tice ldquodistant readingrdquo

For inspiration Moretti drew on the work of French Annaleshistorian Fernand Braudel Braudel amongst the most instrumentalhistorians of the twentieth century pioneered a similarly distantapproach to history his inquiries spanned large expanses of time andspace that of civilizations the Mediterranean world as well assmaller (only by comparison) regional histories of Italy and FranceHis distant approach to history did not necessitate disengagementwith the human actors on the ground seeing instead that beyond thenarrow focus of individual events lay the constant ebb and flow of

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

poverty and other endemic structural features8 While his methodol-ogy does not apply itself well to rapidly changing societies (his focuswas on the long-term slow change) his points around distant read-ing and the need for collaborative interdisciplinary research are auseful antecedent Similar opportunities present themselves whendealing with the vast array of born-digital sources (as well as digitizedhistorical sources) we can distantly read large arrays of sources froma longue dureacutee of time and space mediated through algorithms andcomputer interfaces opening up the ability to consider the voices ofeverybody preserved in the source

Historians must be open to the digital turn due to the astound-ing growth of digital sources and an increasing technical ability toprocess them on a mass scale Social sciences and humanitiesresearchers have begun to witness a profound transformation in howthey conduct and disseminate their work Indeed historical digitalsources have reached a scale where they defy conventional analysisand now require computational analysis Responding to thisarchives are increasingly committed to preserving cultural heritagematerials in digital forms

I would like to pre-emptively respond to two criticisms of thisdigital shift within the historical profession First historians will notall have to become programmers Just as not all historians need a firmgrasp of Geographical Information Systems (GIS) or a developedunderstanding of the methodological implications of community-based oral history the incorporation of a transnational perspective orin-depth engagement with cutting edge demographic models not allhistorians have to approach their trade from a computational per-spective Nor should they Digital history does not replace closereading traditional archival inquiry or going into communities mdash touse only a few examples mdash to uncover notions of collective memoryor trauma Indeed digital historians will play a facilitative role andprovide a broader reading context Yet they will still be historians col-lecting relevant sources analyzing and contextualizing them situatingthem in convincing narratives or explanatory frameworks and dis-seminating their findings to wider audiences

Neither will historians replace professional programmers Insteadas with other subfields historians using digital sources will need to be

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

prepared to work on larger truly interdisciplinary teams Computer sci-entists statisticians and mathematicians for example have criticalmethodological and analytical contributions to make towards questionsof historical inquiry Historians need to forge further links with thesedisciplines Such meaningful collaboration however will require histo-rians who can speak their languages and who have an understanding oftheir work Some of us must be able to discuss computational or algo-rithmic methodologies when we collectively explore the past

The methodologies discussed in this paper fall into the twinfields of data mining and textual analysis The former refers to thediscovery of patterns or information about large amounts of infor-mation while the latter refers to the use of digital tools to analyzelarge quantities of text9 The case study here draws upon three mainconcepts information retrieval (quickly finding what you want)information and term frequency (how often do various concepts andterms appear) and topic modeling (finding frequently occurringbags of words throughout documents) Other projects have involvedauthorship attribution which is finding out the author of docu-ments based on previous patterns style analysis or creating networkgraphs of topics groups documents or individuals

The Culturomics project run by the Cultural Observatory atHarvard University is a notable model of collaboration on a big his-tory project As laid out in Science and put online as the GoogleBooks n-gram viewer (wwwbooksgooglecomngrams) this projectindexed word and phrase frequency across over five million booksenabling researchers to trace the rise and fall of cultural ideas andphenomena through targeted keyword and phrase searches10 This isthe most ambitious and certainly the most advanced and accessibleBig History project yet launched Yet amongst the extensive authorlist (13 individuals and the Google Books team) there were no his-torians This suggests that other disciplines in the humanities andsocial sciences have embraced the digital turn far more than histori-ans However as I mentioned in the coming decades historians willincreasingly rely on digital archives There is a risk that history as aprofessional discipline could be left behind

American Historical Association President Anthony Graftonnoted this absence in a widely circulated Perspectives article Why was

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

there no historian he asked when this was both primarily a histori-cal project and a team comprised of varied doctoral holders fromdisciplines such as English literature psychology computer sciencebiology and mathematics12 Culturomics project leaders ErezLeiberman Aiden and Jean-Baptiste Michel responded franklyWhile the team approached historians and some did play advisoryroles every author listed ldquodirectly contributed to either the creationof the collection of written texts (the lsquocorpusrsquo) or to the design andexecution of the specific analyses we performed No academic histo-rians met this barrdquo They continued to indict the profession

The historians who came to the meeting were intelligentkind and encouraging But they didnrsquot seem to have agood sense of how to wield quantitative data to answerquestions didnrsquot have relevant computational skills anddidnrsquot seem to have the time to dedicate to a big multi-author collaboration Itrsquos not their fault these things donrsquotappear to be taught or encouraged in history departmentsright now13

This is a serious indictment considering the pending archival shift todigital sources and one that still largely holds true (a few historydepartments now offer digital history courses but they are still out-liers) This form of history the massive analysis of digitizedinformation will continue Furthermore these approaches willattract public attention as they lend themselves well to accessibleonline deployment Unless historians are well positioned they riskceding professional ground to other disciplines

While historians were not involved in Culturomics they arehowever involved in several other digital projects Two major andambitious long-term Canadian projects are currently using compu-tational methods to reconstruct the demographic past The first isthe Canadian Century Research Infrastructure project which is cre-ating databases from five censuses and aiming to provide ldquoa newfoundation for the study of social economic cultural and politicalchangerdquo while the second is the Programme de recherche en deacutemo-graphie historique hosted by the Universiteacute de Montreacuteal13 They are

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

reconstructing the European population of Quebec in the seven-teenth and eighteenth centuries drawing heavily on parish registersFurthermore several projects in England have harnessed the wave ofinterest in ldquobig datardquo to tell compelling historical stories about theirpasts the Criminal Intent project for example used the proceedingsof the Old Bailey courts (the main criminal court in Greater London)to visualize large quantities of historical information14 The CriminalIntent project was itself funded by the Digging into Data Challengewhich brings together multinational research teams and fundingfrom federal-level funding agencies in Canada the United States theUnited Kingdom and beyond Several other historical studies havebeen funded under their auspices from studies of commodity tradein the eighteenth- and nineteenth-century Atlantic world (involvingscholars at the University of Edinburgh the University of StAndrews and York University) to data mining the degrees of eco-nomic opportunity and spatial mobility in Britain Canada and theUnited States (bringing researchers at the University of Guelph theUniversity of Leicester and the University of Minnesota together)15

These exciting interdisciplinary teams may be heralding a disciplinalshift although it is still early days

Collaborative teams have limitations though We will have tobe prepared to tackle digital sources alone on shoestring budgetsHistorians will not always be able to call on external programmers orspecialists to do work because teams often develop out of fundinggrants As governmental and university budgets often prove unableto meet consistent research needs baseline technical knowledge isrequired for independent and unfunded work If historians relyexclusively on outside help to deal with born-digital sources the abil-ity to do computational history diminishes As the success of theProgramming Historian indicates and as I will demonstrate in mycase study below historians can learn basic techniques16 Let ustweak our own algorithms

A social or cultural history of the very late-twentieth or twenty-first century will have to account for born-digital sources andhistorians will (in conjunction with archival professionals) have tocurate and make sense of the data Historians need to be able toengage with these sources themselves or at least as part of a team

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

rather than having sources mediated through the lens of outside dis-ciplines or commercial interests Algorithms may be one meaningfulway to make sense of cultural trends or extract meaningful informa-tion from billions of tweets but we must ensure professionalhistorical input goes into their crafting

The Third Wave of Computational History

Historians have fruitfully used computers in the past as part of twoprevious waves The 1960s and 1970s saw computational tools usedfor demographic population and economic histories a trend that inCanada saw full realization in comprehensive studies such as MichaelKatzrsquos The People of Hamilton Canada West17 As Ian Anderson hasconvincingly noted in a review of history and computing studies ofthis sort saw the field become associated with quantitative studies18

After a retreat from the spotlight computational history rose again inthe 1990s with the advent of the personal computer graphical userinterfaces and overall improvements in ease of use Punch cards couldnow give way to database programs GIS and early online networkssuch as H-Net and Usenet We are now I believe on the cusp of athird revolution in computational history thanks to three main fac-tors decreasing storage costs the power of the Internet anddistributed cloud computing and the rise of professionals dealingwith both digital preservation and open-source tools

Decreasing storage costs have led to massive amounts of infor-mation being preserved as Gleick noted in this articlersquos epigraph Itis important to note that this information overload is not newPeople have long worried about the impact of too much informa-tion19 In the sixteenth century responding to the rise of the printingpress the German priest Martin Luther decried that the ldquomultitudeof books [were] a great evilrdquo In the nineteenth century Edgar AllanPoe bemoaned ldquo[t]he enormous multiplication of books in everybranch of knowledge is one of the greatest evils of this agerdquo Asrecently as 1970 American historian Lewis Mumford lamented ldquotheoverproduction of books will bring about a state of intellectual ener-vation and depletion hardly to be distinguished from massiveignorancerdquo20 The rise of born-digital sources must thus be seen in

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

this continuous context of handwringing around the expansion andrise of information

Useful historical information is being preserved at a rate that isaccelerating with every passing day One useful metric to comparedataset size is that of the American Library of Congress (LOC)Indeed a ldquoLOCrdquo has become shorthand for a unit of measurementin the digital humanities Measuring the actual extant of the printedcollection in terms of data is notoriously difficult James Gleick hasclaimed it to be around 10TB22 whereas the Library of Congressitself claims a more accurate figure is about 200TB23 While thesetwo figures are obviously divergent neither is on an unimaginablescale mdash personal computers now often ship with a terabyte or moreof storage When Claude Shannon the father of information theorycarried out a thought experiment in 1949 about items or places thatmight store ldquoinformationrdquo (an emerging field which viewed messagesin their aggregate) he scaled sites from a digit upwards to the largestcollection he could imagine The list scaled up logarithmically from10 to the power of 0 to 10 to the power of 13 If a single-spaced pageof typing is 104 107 the Proceedings of the Institute of Radio Engineers109 the entire text of the Encyclopaedia Brittanica then 1014 was ldquothelargest information stockpile he could think of the Library ofCongressrdquo24

But now the Library of Congressrsquo print collection mdash somethingthat has taken two centuries to gather mdash is dwarfed by their newestcollection of archived born-digital sources the vast majority only amatter of years old compared to the much wider date-range of non-digital sources in the traditional collection The LOC has beguncollecting its own born-digital Internet archive multiple snapshotsare taken of webpages to show how they change over time Even asa selective curated collection drawing on governmental politicaleducational and creative websites the LOC has already collected254TB of data and adds 5TB a month25 The Internet Archivethrough its Wayback Machine is even more ambitious It seeks toarchive every website on the Internet While its size is also hard toput an exact finger on as of late 2012 it had over 10 Petabytes ofinformation (or to put it into perspective a little over 50 Library ofCongresses if we take the 200TB figure)26

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

This is to some degree comparing apples and oranges theLibrary of Congress is predominantly print whereas the InternetArchive has considerable multimedia holdings which includesvideos images and audio files The LOC is furthermore a morecurated collection whereas the Internet Archive draws from a widerrange of producers The Internet offers the advantages and disadvan-tages of being a more democratic archive For example a Geocitiessite (an early way for anybody to create their own website) created bya 12-year old Canadian teenager in the mid-1990s might be pre-served by the Internet Archive whereas it almost certainly would notbe saved by a national archive This difference in size and collectionsmanagement however is at the root of the changing historiansrsquotoolkit If we make use of these large data troves we can access a newand broad range of historical subjects

Conventional sources are also part of this deluge Google Bookshas been digitizing the printed word and it currently has 15 millionworks completed by 2020 it audaciously aims to digitize every bookever written While copyright issues loom the project will allowresearchers to access aggregate word and phrase data as seen in theGoogle n-gram project27 In Canada Library and Archives Canada(LAC) has a more circumscribed digitization project collecting ldquoarepresentative sample of Canadian websitesrdquo focusing particularlyon Government of Canada webpages as well as a curated approach

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

preserving case studies such as the 2006 federal election Currentlyit has 4TB of federal government information and 7TB of totalinformation in its larger web archive (a figure which as I note belowis unlikely to drastically improve despite ostensible goals of archivallsquomodernizationrsquo)28

In Canada LAC has mdash on the surface mdash has been advancingan ambitious digitization program as it realizes the challenges nowfacing archivists and historians Daniel J Caron the Librarian andArchivist of Canada has been outspoken on this front with severalpublic addresses on the question of digital sources and archivesThroughout 2012 primary source digitization at LAC preoccupiedcritics who saw it as a means to depreciate on-site access TheCanadian Association of University Teachers the Canadian HistoricalAssociation as well as other professional organizations mountedcampaigns against this shift in resources29 LAC does have a pointhowever with their modernization agenda The state of Canadarsquosonline collections are small and sorely lacking when compared totheir on-site collections and LAC does need to modernize and achievethe laudable goal of reaching audiences beyond Ottawa

Nonetheless digitization has unfortunately been used as asmokescreen to depreciate overall service offerings30 The modern-ization agenda would also see 50 percent of the digitization staffcut31 LACrsquos online collections are small they do not work withonline developers through Application Programming Interfaces(APIs a way for a computer program to talk to another computerand speed up research mdash Canadianaca is incidentally a leader in thisarea) and there have been no promising indications that this state ofaffairs will change Digitization has not been comprehensive Onlinepreservation is important and historians must fight for both on-siteand on-line access

A number of historians have recognized this challenge and areplaying an instrumental role in preserving the born-digital past TheRoy Rosenzweig Center for History and New Media at GeorgeMason University launched several high-profile preservation pro-jects the September 11th Digital Archive the Hurricane DigitalMemory Bank (focusing on Hurricanes Katrina and Rita) and as ofwriting the Occupy archive32 Massive amounts of online content

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

is curated and preserved photographs news reports blog posts andnow tweets These complement more traditional efforts of collectingand preserving oral histories and personal recollections which arethen geo-tagged (tied to a particular geographical location) tran-scribed and placed online These archives serve a dual purposepreserving the past for future generations while also facilitating easydissemination for todayrsquos audiences

Digital preservation however is a topic of critical importanceMuch of our early digital heritage has been lost Many websites espe-cially those before the Internet Archiversquos 1996 web archiving projectlaunch are completely gone Early storage mediums (from tape toearly floppy disks) have become obsolete and are now nearly inac-cessible Compounding this software development has seen earlyproprietary file formats depreciated ignored and eventually madewholly inaccessible33 Fortunately the problem of digital preservationwas recognized by the 1990s and in 2000 the American governmentestablished the National Digital Information Infrastructure andPreservation Program34 As early as 1998 digital preservationistsargued ldquo[h]istorians will look back on this era hellip and see a period ofvery little information A lsquodigital gaprsquo will span from the beginningof the wide-spread use of the computer until the time we eventuallysolve this problem What wersquore all trying to do is to shorten thatgaprdquo To bear home their successes that quotation is taken from anInternet Archive snapshot of Wired magazinersquos website35

Indeed greater attention towards rigorous metadata (which isbriefly put data about data often preserved in plain text format)open-access file formats documented by the InternationalOrganization for Standardization (ISO) and dedicated digitalpreservation librarians and archivists gives us hope for the future36

It is my belief that cloud storage the online hosting of data in largethird-party offsite centres mitigates the issue of obsolete storagemedium Furthermore that the move towards open source file for-mats even amongst large commercial enterprises such as Microsoftprovides greater documentation and uniformity

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A Brief Case Study Accessing and Navigating Canadarsquos DigitalCollections With an Emphasis on Tools for Born-DigitalSources

What can we do about this digital deluge There are no simpleanswers but historians must begin to conceptualize new additions totheir traditional research and pedagogical toolkits In the section thatfollows I will do two things introduce a case study and several toolsboth proprietary and those that are open and available off-the-shelfto reveal one way forward

For a case study I have elected to use Canadarsquos DigitalCollections (CDC) a collection of lsquodeadrsquo websites removed fromtheir original location on the World Wide Web and now digitallyarchived at LAC In this section I will introduce the collection usingdigitally-archived sources from the early World Wide Web and dis-cuss my research methodology before illustrating how we can derivemeaningful data from massive arrays of unstructured data37

This is a case study of a web archive Several of these method-ologies could be applied to other formats including large arrays oftextual material whether born-digital or not A distant reading of thistype is not medium specific It does however assume that the text ishigh quality Digitized primary documents do not always have textlayers and if they are instituted computationally through OpticalCharacter Recognition errors can appear Accuracy rates for digitizedtextual documents can range from as low as 40 percent for seven-teenth and eighteenth century documents to 80ndash90 percent formodern microfilmed newspapers to over 99 percent for typesetword-processed documents38 That said while historians today maybe more interested in applying distant reading methodologies tomore traditional primary source materials it is my belief that in thenot-so-distant future websites will be an increasingly significantsource for social cultural and political historians

What are Canadarsquos Digital Collections In 1996 IndustryCanada funded by the federal Youth Employment Strategy facili-tated the development of Canadian heritage websites by youth(defined as anyone between the age of 15 and 30) It issued contractsto private corporations that agreed to use these individuals as web

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

developers for the twofold purpose of developing digital skills andinstilling respect for and engagement with Canadian heritage andpreservation39 The collection initially called SchoolNet butrebranded as Canadarsquos Digital Collections in 1999 was accessiblefrom the CDC homepage After 2004 the program was shut downand the vast majority of the websites were archived on LACrsquos web-page This collection is now the largest born-digital Internet archivethat LAC holds worthy of study as content as well as thinkingabout methodological approaches

Conventional records about the program are unavailable in thetraditional LAC collection unsurprising in light of both its recentcreation and backlogs in processing and accessioning archival mate-rials We can use one element of our born-digital toolkit the InternetArchiversquos WaybackMachine (its archive of preserved early webpages)to get a sense of what this project looked like in its heyday The mainwebpage for the CDC was wwwcollectionsicgc However a visitortoday will see nothing but a collection of broken image links (two ofthem will bring you to the English and French-language versions ofthe LAC-hosted web archive although this is only discerniblethrough trial and error or through the sitersquos source code)

With the WaybackMachine we have a powerful albeit limitedopportunity to ldquogo back in timerdquo and experience the website as previ-ous web viewers would have (if a preserved website is available ofcourse) Among the over 420 million archived websites we can searchfor and find the original CDC site Between 12 October 1999 andthe most recent snapshot or web crawl (of the now-broken site) on 6July 2011 the site was crawled 410 times40 Unfortunately every siteversion until the 15 October 2000 crawl has been somehow cor-rupted Content has been instead replaced by Chinese characterswhich translate only into place names and other random words Issuesof early digital preservation noted above have reared their head inthis collection

From late 2000 onward however fully preserved CDC sites areaccessible Featured sites new entries success stories awards curric-ular units and prominent invitations to apply for project fundingpopulate the brightly coloured website The excitement is palpableJohn Manley then Industry Minister declared ldquoit is crucial that

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Canadarsquos young people get the skills and experience they need to par-ticipate in the knowledge-based economy of the futurerdquo and thatCDC ldquoshows that young people facing significant employment chal-lenges can contribute to and learn from the resources available onthe Information Highwayrdquo41 The documentation reflecting thelower level of technical expertise and scarcity of computer hardwareat the time was comprehensive participants were encouraged toidentify a collection digitize it process it (by cleaning up the data)preserve it on a hard drive and display it on the web Issues of copy-right were discussed as well as preservation issues and hardwarerequirements for computers (a computer with an Intel 386 processor8MB RAM a 1GB hard drive and 288bps modem was recom-mended) and training information for employers educators andparticipants42 Sample proposals community liaison ideas amidstother information set out a very comprehensive roadmap for a grass-roots bottom-up approach to digital preservation and heritage

Moving through the preserved websites is an interesting snap-shot at how web design standards changed Colour disappearsreplaced by a more subdued coordinated federal website standard(still largely in effect today) Surveys greet viewers from 2003onwards trying to learn how viewers discovered the site whetherthey think it is a good educational resource (they must have beenrelieved that the majority thought so mdash in 2003 some 84 percentfelt that it was a ldquogoodrdquo resource the best of its type)43 By 2004 theprogram was engaging in active remembrance of its past programsas it neared its ten-year anniversary Proposals stopped beingaccepted in July 2004 but the 600 webpages were still accessiblewith a comprehensive subject index alphabetical listing success sto-ries and detailed copyright information

The Canadarsquos Digital Collections site began to move towardsremoval in November 2006 and was initially completely gone A visi-tor to the site would learn that the content ldquois no longer availablerdquo andan e-mail address was provided for contributors to get their informationback44The information was completely inaccessible and by early 2007a bold notice announced that the now-empty site would close as of 2April 2007 Luckily by April 2007 a notice appeared with a declara-tion that the collections were now archived by Library and Archives

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Canada just in time for the website to completely disappear by July2007 In the archived version subject indexes and headings were goneAll that remained then and today for that matter is a straight-forwardalphabetical list and a disclaimer that noted ldquo[y]ou are viewing a doc-ument archived by Library and Archives Canada Please noteinformation may be out of date and some functionality lostrdquo

The question now is what we can do with this collection ofwebsites Could a distant reading as opposed to a close reading ona targeted level tell us something about this collectionFortunately we have a variety of tools that can help us navigate thislarge array of information several of which I use to navigate largeamounts of data First I create my own tools through computerprogramming This work is showcased in the following section I program in Wolfram Researchrsquos Mathematica 9 Mathematica is anintegrated platform for technical computing that allows users toprocess visualize and interact with exceptional arrays of informa-tion45 It is a language with an engaged and generous user base thatcomplements exhaustive well-written documentation fromWolfram Research To emphasize its ease one line of code couldimport a website in perfectly readable plaintext (avoiding unicodeerrors) as follows input=Import[ldquohttpactivehis-torycaldquordquoPlaintextrdquo] and a second one could make itall in lower text for textual analysis as such lowerpage=ToLowerCase[input] Variables are a critical building blockIn the former example ldquoinputrdquo now holds the entire plaintext of thewebsite ActiveHistoryca and then ldquolowerpagerdquo holds it in lowercaseOnce we have lower-cased information we are quickly able to extractthe following frequent words frequent phrases and could thus cre-ate a dynamic database of n-grams word clouds or run other textualanalysis programs on the material Much of the syntax adheres to arecognizable English-language format

Even more promising is Mathematicarsquos integration with theWolfram|Alpha database (itself accessible at wolframalphacom) acomputational knowledge engine that promises to put incredibleamounts of information at a userrsquos fingertips With a quick com-mand one can access the temperature at Torontorsquos Lester B PearsonInternational Airport at 500 pm on 18 December 1983 the rela-

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

tive popularity of names such as ldquoEdithrdquo or ldquoEmmardquo the Canadianunemployment rate in March 1985 or the Canadian GrossDomestic Product in December 199546 This allows one to quicklycross-reference findings with statistical information or find coinci-dental points of comparison As universities are now being activelyencouraged to contribute their information to Wolfram|Alpha thiswill become an increasingly powerful database

For many researchers the downside of Mathematica is a con-siderable one It is proprietary software and relatively expensive forstudent and non-student alike Outputs however can be dissemi-nated using the Computational Document Format (CDF) enablinghistorians to put their work online and entice the general public todynamically work with their models and findings

Programming is not the be all and end all Information visual-ization in the humanities is a growing international field withCanadian scholars heavily involved The most important and acces-sible example is Voyant Tools47 It is a powerful suite of textualanalysis tools Uploading onersquos own text or collection of texts draw-ing on a webpage or analyzing existing bodies of work one sees thefollowing a ldquoword cloudrdquo of a corpus click on any word to see itsrelative rise and fall over time (word trends) access the original textand click on any word to see its contextual placement within indi-vidual sentences It ably facilitates the ldquodistancerdquo reading of anyreasonably-sized body of textual information and should be a begin-ning point for those interested in this field The only downside isthat Voyant can choke on very large amounts of information as it isa textual analysis rather than robust data mining tool

There are several other complementary tools A notable one is theMany Eyes website hosted by IBM Research Many Eyes allows you toupload datasets be they textual tabular or numerical There are manyvisualization options provided you can plot relationships betweendata compare values (ie bar charts and bubble charts) track dataover time and apply information to maps The only downside toMany Eyes is that you must register and make your datasets available toall other registered researchers as well as IBM This limits your abilityto use copyrighted or proprietary information

More advanced technical users can avail themselves of a host of

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

textual analysis programs found elsewhere on the web The mostcomprehensive gathering of these tools can be found with theSoftware Environment for the Advancement of Scholarly Research(SEASR) and its Meandre workbench Meandre which can beinstalled as a local server on your system or mdash in a best case scenariomdash hosted by an institutional IT department allows you to tap intoa variety of workflows to analyze data48 One workflow mentionedin this article is the topic modeling toolkit found within the MachineLearning for Language Toolkit (MALLET) package which clustersunstructured data into ldquotopicsrdquo which appear together giving a quicksense of a documentrsquos structure49 Other workflows found withinSEASR include Dunningrsquos log-likelihood statistic which compares aselected document to a broader reference corpus For example youcould compare the lyrics that appear in 1968 against all lyrics in thepostwar era and learn what makes 1968 most unique50 sentimentanalyses which attempt to plot changes in emotional words andphrases over time and a variety of entity extraction algorithms(which find ldquopeoplerdquo ldquoplacesrdquo and so forth in a text) Corpuses thatwere previously too big to extract meaningful information from cannow be explored relatively quickly

As befits a technical side to history all of the freely availabletools discussed briefly here (notably wget SEASR and MALLET)are discussed in more technical terms in the peer-reviewedProgramming Historian 2 or on my own blog at ianmilliganca I haveelected to leave them online rather than provide commands in thisarticle they require some technical knowledge and directly workingfrom the web is quicker than here Please consider the following sec-tion an applied version of some of the lessons you can learn throughthese processes

From Web Collection to Dynamic Finding Aids

Let us briefly return to our archival metaphor Todayrsquos historiandealing with conventional sources often has to physically travel to anarchive requisition boxes perhaps take photographs of materials iftime is tight and otherwise take notes With born-digital sourceshowever the formula is reversed Our sources can come to us In a

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

few keystrokes we can bring an unparalleled quantity of informationto our home or work computers With Canadarsquos Digital Collectionsa single command line entry brought the entire archive onto my sys-tem it took days to process the information and I had to leave mycomputer on but it did not require any active user engagement

How big is the CDC collection in terms of sheer informationThe plaintext without any HTML markup is approximately7805135 words If each page contained 250 words it would be theequivalent of a 31220 page essay or 1000 dissertations However itis even more complicated than just reading 1000 dissertations Theinformation is spread across 73585 separate HTML files themselvespart of a much broader corpus of 278785 files Even the sheer sizeof the database requires big data methodologies an aggregate size of726 GB 360 MB of plaintext with HTML encoding reduced to alsquomerersquo 50MB if all encoding is removed A 50MB plain text docu-ment can quickly slow down even a relatively powerful personalcomputer There is no simple way to make this a conventionallyaccessible amount of information

How can we access this information Several different methodsare possible First we can work with the information on the internetitself or we can download it onto our own system I prefer the lattermethod as it decreases computational time adds flexibility andenables free rein to experiment with material Before you do thisthough you need to be familiar with the websitersquos ldquoterms of servicerdquoor the legal documentation that outlines how users are allowed to usea website resources or software In this case LAC provides generousnon-commercial reproduction terms information ldquomay be repro-duced in part or in whole and by any means without charge orfurther permission unless otherwise specifiedrdquo51

The CDC web archive index is organized like so the index islocated at wwwepelac-bacgcca100205301iccdcEAlphabetaspwith every subsequent website branching off the cdc folder Forexample one website would be hosted in wwwepelac-bacgcca100205301iccdcABbenevolat the next in hellipcdcabnatureand so on Using wget a free software program you can downloadall websites in a certain structure the program recursively downloadsinformation in a methodical fashion I first used wget to save the

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

cookies (little files kept on your system in this case one that notedthat you had seen the disclaimer about the archival nature of thewebsite) and then with an one-line command in my command lineto download the entire database52 Due to the imposition of a ran-dom wait and bandwidth limit which meant that I would downloadone file every half to one and a half seconds at a maximum of 30kilobytes a second the process took six days In the ProgrammingHistorian 2 I have an article expanding upon wget for humanitiesresearchers

Hypertext itself deserves a brief discussion The basic buildingblock of the web is the link one page is connected to another andpeople navigate through these webs While books and articles forexample are connected to each other by virtue of ideas and concretecitations these must often be inferred online HTML born-digitalsources are different in that the link offers a very concrete file One

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Visualizing the CDC Collection as a Series of Hyperlinks

quick visualization the circular cluster seen above lets us see theunique nature of this information We can go to each web pagescrape the links and see where they take viewers As we see in thebelow figure the CDC collection can be visualized as a series ofhyperlinks (each line) The cluster in the middle is the CDC collec-tion itself whereas the strands on the outside represent externalwebsites Using Mathematica I have generated a dynamic modelyou can move your mouse over each line learn what link it is andthus see where the collection fits into the broader Internet The siteis limited to following two links beyond the first page Through thisapproach we see the interconnection of knowledge and relationshipsbetween different sources At a glimpse we can see suggestions to fol-low other sites and determine if any nodes are shared amongst thevarious websites revealed

Visualization Moving Beyond Word Clouds

Where to begin with a large collection such as the Canadarsquos DigitalCollection corpus As a historian who is primarily interested in thewritten word my first step was to isolate the plaintext throughout thematerial so that I could begin rudimentary data mining and textualanalysis on the corpus For this I approached the CDC collection ontwo distinct levels mdash first the aggregate collection and second divid-ing the collection into each lsquofilersquo or individual website mdash derivingword and phrase data for each Imagine if you could look throughinto an archival collection and get a quick sense of what topics andideas were covered Taking that thought further imagine the utility ifyou could do this with individual archival boxes and even files Witha plethora of born-digital sources this is now a necessary step

On an aggregate level word and phrase data is a way to explorethe information within How often do various words within thenearly eight million words of the collection appear UsingMathematica we can quickly generate a list of words sorted by theirrelative frequency we learn that the word ldquoCanadardquo appears 26429times ldquocollectionsrdquo 11436 and ldquo[L]ouisbourgrdquo an astounding9365 times Reading such a list can quickly wear out a reader and isof minimal visual attractiveness Two quick decisions were made to

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

enhance data usability first the information was ldquocase-foldedrdquo allsent to lower-case this enables us to count the total appearances of aword such as ldquodigitalrdquo regardless of whether or not it appears at thebeginning of a sentence

Second when dealing with word frequency mdash as opposed tophrase frequency mdash ldquostop wordsrdquo should be removed as they werehere These are ldquoextremely common words which would appear to beof little value in helping select documents matching a user needrdquowords such as ldquotherdquo ldquoandrdquo ldquotordquo and so forth They matter forphrases (the example ldquoto be or not to berdquo is comprised entirely ofstop words) but will quickly dominate any word-level visualizationor list53 Other words may need to be added or removed from stoplists depending on the corpus for example if the phrase ldquopage num-berrdquo appeared on every page that would be useless information forthe researcher

The first step then is to use Mathematica to format word fre-quency information into Wordlenet readable format (if we put allof the text into that website it will crash mdash so we need to followthe instructions on that website) we can then draw up a wordcloud of the entire collection Here we can see both the advantagesand disadvantages of static word clouds What can we learn from this

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A Word Cloud of All Words in the CDC Corpus

illustration First without knowing anything about the sites we canimmediately learn that it is dealing with CanadaCanadians that itis a collection with documents in French and English that we aredealing with digitized collections and that there is a fair amount ofregional representation not based upon population (note thatldquoontariordquo is less frequent than both ldquonewfoundlandrdquo ldquonova scotiardquoand ldquoalbertardquo although ldquoqueacutebecrdquo and ldquoquebecrdquo is not terribly promi-nent) Certain kinds of history dominate communities familynames the built form (houses churches schools and so forth) At avery macro level we can learn several characteristics before doing anydeeper investigation

These forms of visualization need to be used with extreme cau-tion however First while visually attractive it leaves a severeamount of ambiguity are ldquoontariordquo and ldquoqueacutebecrdquo equally promi-nent for example or is there a slight divergence Even moreproblematically we cannot learn context from the above picturedoes the word ldquoschoolrdquo refer to built structures or educational expe-riences The same question with applies to ldquochurchrdquo Do mentionsof ldquonewfoundlandrdquo refer to historical experiences there or withexpatriates or in travel narratives Word clouds are useful first stepsand have their place but need to be used with extreme caution Formore scholarly applications tables offer much more precise inter-pretation As the second figure below demonstrates they can bemore helpful when dealing with the individual sites cuttingthrough ambiguity The figure below is a visualization of the web-site ldquopast and presentrdquo Despite the ambiguous title we can quicklylearn what this website contains from this visualization Wordle isthus a useful first step analogous to taking magically taking a quicklook through the side of an archival box We still need more infor-mation however to really make this form of textual analysishelpful To navigate this problem I have created my own dynamicfinding aid program using Mathematica which scales to any born-digital web collection

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Two approaches were taken First a textual representation of wordand phrase frequency allows us to determine the content of a given filerelatively quickly I wrote a short program to extract word frequency(stop words excluded) and phrase frequency For phrases we speak ofn-grams bigrams trigrams quadgrams and fivegrams although onecould theoretically do any higher number A bigram can be a combina-tion of two characters such as lsquobirsquo two syllables or mdash pertinent in ourcase mdash two words An example ldquocanada isrdquo A trigram is three wordsquadgram is four words and a fivegram is five words and so on

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Word Cloud of the File ldquoPast and Presentrdquo

A viewer for n-grams in the CDC Collection

The CDC collection demonstrates this approachrsquos utility Usingmy Mathematica textual CDC browser pictured above I randomlyexplored file 267 the ambiguously named ldquopast and presentrdquo web-site This is one of the largest individual websites 122363 words orabout 493 standard typed pages Such an ambiguous name could beabout almost anything Initially I decided to study word frequency

ldquoldquoalbertardquo 758rdquo ldquoldquolistenrdquo 490rdquo ldquoldquosettlersrdquo 467rdquoldquoldquonewrdquo 454rdquo ldquoldquocanadardquo 388rdquo ldquoldquolandrdquo 362rdquo ldquoldquoreadrdquo328rdquo ldquoldquopeoplerdquo 295rdquo ldquoldquosettlementrdquo 276rdquo ldquoldquochineserdquo268rdquo ldquoldquocanadianrdquo 264rdquo ldquoldquoplacerdquo 255rdquo ldquoldquoheritagerdquo254rdquo ldquoldquoedmontonrdquo 250rdquo ldquoldquoyearsrdquo 238rdquo ldquoldquowestrdquo 236rdquoldquoldquoraymondrdquo 218rdquo ldquoldquoukrainianrdquo 216rdquo ldquoldquocamerdquo 213rdquoldquoldquocalgaryrdquo 209rdquo ldquoldquonamesrdquo 206rdquo ldquoldquoranchrdquo 204rdquoldquoldquoupdatedrdquo 193rdquo ldquoldquohistoryrdquo 190rdquo ldquoldquocommunitiesrdquo183rdquo hellip

Rather than having to visit the website at a glance we can almostimmediately see what the site is about More topics become apparentwhen one looks at trigrams for example in this automatically gen-erated output of generated top phrases

ldquoldquofirstrdquo ldquopeoplerdquo ldquosettlersrdquo 100rdquo ldquoldquotherdquo ldquounitedrdquoldquostatesrdquo 72rdquo ldquoldquoonerdquo ldquoofrdquo ldquotherdquo 59rdquo ldquoldquoofrdquo ldquotherdquoldquowestrdquo 58rdquo ldquoldquosettlersrdquo ldquolastrdquo ldquoupdatedrdquo 56rdquo ldquoldquopeoplerdquoldquosettlersrdquo ldquolastrdquo 56rdquo ldquoldquoalbertansrdquo ldquofirstrdquo ldquopeoplerdquo 56rdquoldquoldquoadventurousrdquo ldquoalbertansrdquo ldquofirstrdquo 56rdquo ldquoldquotherdquo ldquobarrdquoldquourdquo 55rdquo ldquoldquoopeningrdquo ldquoofrdquo ldquotherdquo 54rdquo ldquoldquocommunitiesrdquoldquoadventurousrdquo ldquoalbertansrdquo 54rdquo ldquoldquonewrdquo ldquocommunitiesrdquoldquoadventurousrdquo 54rdquo ldquoldquobarrdquo ldquourdquo ldquoranchrdquo 53rdquo ldquoldquolistenrdquoldquotordquo ldquotherdquo 52rdquo hellip

Through a few clicks we can navigate and determine the basiccontours mdash from a distance mdash of what the site pertains to In aviewer you can use a slider to dynamically see more informationor less information quickly toggling between bigrams trigramsand quadgrams allowing you to quickly hone in on immediately

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

relevant information This can be useful but we need to moveback even further to achieve a distant reading of this born-digitalcollection

If one zooms back out to this very distant level we can extractn-gram data from the entire collection We have two approachesfirst aggregate appearances of a given term (if ldquoaboriginalrdquo appears457 times then n=457) or second relative appearances of a giventerm (if ldquoaboriginalrdquo appears in 45 of 384 sites then n=45384 or01171875) In my other work on music lyrics the latter optionenables you to control for the lsquochorusrsquo effect as the former can be drowned out by one song with repetitive words Given thenature of this collection I elected for the former approach Whatcan we learn from trigrams for example After an initial set of trigrams pertaining to the Louisbourg digitization project whichhas boilerplate language on each page as well as common ldquostop-wordrdquo equivalents of phrases (especially pronounced in French) we come across the first set of meaningful phrases excerptedbelow

ldquoexperiencerdquo ldquoinrdquo ldquotherdquo 919 ldquotransferredrdquo ldquotordquo ldquotherdquo918 ldquofundedrdquo ldquobyrdquo ldquotherdquo 918 ldquoofrdquo ldquotherdquo ldquoprovincialrdquo915 ldquothisrdquo ldquowebrdquo ldquositerdquo 915 ldquo409rdquo ldquoregistryrdquo ldquo2rdquo914 ldquoarerdquo ldquonordquo ldquolongerrdquo 914 ldquogouvernementrdquo ldquodurdquo ldquocanadardquo 910 ldquouserdquo ldquothisrdquo ldquolinkrdquo 887ldquopagerdquo ldquoorrdquo ldquoyourdquo 887 ldquofollowingrdquo ldquopagerdquo ldquoorrdquo 887 ldquoautomaticallyrdquo ldquotransferredrdquo ldquotordquo 887 ldquoberdquoldquoautomaticallyrdquo ldquotransferredrdquo 887 ldquowillrdquo ldquoberdquo ldquoautomaticallyrdquo 887 ldquoindustryrdquo ldquoyourdquo ldquowillrdquo 887ldquomultimediardquo ldquoindustryrdquo ldquoyourdquo 887 ldquopracticalrdquo ldquoexperiencerdquo ldquoinrdquo 887 ldquogainedrdquo ldquopracticalrdquo ldquoexperiencerdquo887 ldquotheyrdquo ldquogainedrdquo ldquopracticalrdquo 887 ldquowhilerdquo ldquotheyrdquoldquogainedrdquo 887 ldquocanadiansrdquo ldquowhilerdquo ldquotheyrdquo 887ldquoyoungrdquo ldquocanadiansrdquo ldquowhilerdquo 887 ldquoshowcaserdquo ldquoofrdquoldquoworkrdquo 887 ldquoexcellentrdquo ldquoshowcaserdquo ldquoofrdquo 887 hellip

Here we see an array of common language funding informa-tion link information project information spread across over 887

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

websites Reading down we see information about governments jobexperience project information initiatives re-structuring and soforth At a glance we can see what language is common to all web-sites as opposed to unique information Moving down we find themore common but less universal ldquothe history ofrdquo for example ldquoher-itage community foundationrdquo libraries and so forth By the end ofthe list we can move into the unique or very rare phrases that appearonly once twice or three times

Moving beyond word and phrase frequency there are a fewmore sophisticated off-the-shelf options that can help us make senseof large arrays of unstructured data Topic modeling is among themost developed as Princetonrsquos David M Blei (a Computer Scienceprofessor) describes in a general introduction to the topic

While more and more texts are available online we simplydo not have the human power to read and study them toprovide the kind of [traditional] browsing experiencedescribed above To this end machine learning researchershave developed probabilistic topic modeling a suite of algo-rithms that aim to discover and annotate large archives ofdocuments with thematic information Topic modelingalgorithms are statistical methods that analyze the words ofthe original texts to discover the themes that run throughthem how those themes are connected to each other andhow they change over time hellip Topic modeling enables usto organize and summarize electronic archives at a scalethat would be impossible by human annotation55

Using the aforementioned SEASR Meandre workbench (or viaMALLET itself and a command-line interface) we can quickly beginto run our born-digital information through the topic-modelingalgorithm The results are notable and come in two formats a wordcloud visualization as well as an XML file containing a line-by-linebreakdown of each topic Taking the aforementioned file 267 (ldquopastand presentrdquo) we can begin to get a sense of the topics this large web-site discusses

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

At a glance we can learn about some of the main themes the struc-ture of Alberta (Alberta towns bridges lakes) the geographic shapeof the area (life land lands areas links grass and interestingly lan-guage) the stories (story knight archives but also interestinglyknight and wife) and so forth The word cloud gives us a sense ofthe most relevant terms that define each topic but we also have a3200-line XML file broken into topics which lets us further refineour inquiry

A last set of examples will help us move from the aggregate levelof hundreds of thousands of lines of text to the individual line inessence creating our own search engine to navigate this collection

There are a few good reasons why we should not use Google todo all of this in our own work First we do not know how Googleworks precisely It is a proprietary black box While the original algo-rithm is available the actual methods that go into determining theranking of search terms is a trade secret Second Google is optimizedfor a number of uses mainly to deliver the most relevant informa-tion to the user Historians and archivists are looking often forspecific information within old websites and it behooves us to exam-ine them through the lens of more specialized tools With thesehesitations in mind I have sought to develop my own rudimentarysearch engine for this born-digital collection

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A two-pronged approach can take us to relevant informationquickly In the following example I am carrying out a project onblack history We quickly search all of the CDC word frequency filesfor the occurrence of the word ldquofreedomrdquo ldquoblackrdquo and ldquolibertyrdquo(these terms of course require some background knowledge thesearch is shaped by the user) obtain a list and then map it onto achart in order to quickly visualize their locations in the collection Forexample in the graph above we see a visualization of a given term inthis case our three search terms As one moves the mouse over eachbar the frequency data for a given term appears We can also zoomand enhance this graph to use it as we wish A demonstration of thiscan be found online at wwwianmilliganca20130118cdc-tool-1

We can then move down into the data In the example aboveldquo047rdquo labeled ldquoblackgoldtxtrdquo contains 31 occurrences of the wordldquoontariordquo Moving to my Mathematica keyword-in-context routinewe can see where these occurrences of the word appear

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

A snapshot of the dynamic finding aid tool

We see that this website focuses mostly on southwestern Ontariowhere many oil fields are If we move forward we can see that it per-tains to the nineteenth century a particular region as well assubsequent memorialization activities (mention of a museum forexample) All this can be done from within the textual client asopposed to going out into individual websites Broad contoursemerge from this distant reading but they also facilitate deeper read-ing by allowing us to promptly realize critical sources

Another model through which we can visualize documents isthrough a viewer primarily focusing on term document frequencyand inverse frequency terms discussed below To prepare this mater-ial we are combining our earlier data about word frequency (that theword ldquoexperiencerdquo appears 919 times perhaps) with aggregate infor-mation about frequency in the entire corpus (ie the wordldquoexperiencerdquo appears 1766 times) The number of times that a termappears in a given website can be defined as tf (term frequency)Another opportunity is inverse document frequency where we canlearn what words are most distinct and unique in a given documentThe formula for this is the oft-cited tf-idf a staple of informationretrieval theory or tf times the logarithm of the number of docu-ments divided into the number of documents where the given termappears At a glance then we can see some of the unique words con-tained within a website if we select ldquoinverse document frequencyrdquo

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

At a very quick glance one can see the major words that make this aunique document This technique can be used on any textual docu-ment

Off-the-shelf tools can facilitate a deeper yet distant reading ofindividual websites as well especially the suite of research tools onwebsites such as Many Eyes and Voyant Using these tools howeverrequires that one puts individual data into each separate programrather than quickly navigating between them This stems from themassive amount of files and data involved Many Eyes limits files to5MB and Voyant has no hard and fast rule but can struggle withlarger files

This analysis was carried out on my personal computer55 Largerdata sets may necessitate a corresponding increase in computationalpower bringing social sciences and humanities researchers into con-tact with High Performance Computing (HPC) centres or consortiaThese extremely powerful computers provide researchers with com-puting power orders of magnitude above that of personal computersallowing results in hours instead of the weeks or months it might

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

take to process on a smaller system56 This has particular applicationin the basic processing of a very large web archive There are limita-tions however jobs must be submitted into queues in advancerather than on-site processing and perhaps most importantly HPCcentres prefer to run highly optimized code to minimize usage oftheir very expensive and scarcely allocated equipment As this level ofcoding expertise is often beyond humanities researchers researchteams must be formed

Humanities researchers would do well to familiarize themselveswith local HPC options many of which are more than happy towork on new and engaging questions These are accessible optionsin Ontario for example all but one university belongs to a HPCconsortium57 Yet as personal computing continues to dramaticallyincrease in speed and as storage prices continue to plummet textualanalysis is increasingly feasible on high-quality office systems An ini-tial exposure can generate questions and problems that if the needrears itself can subsequently be taken to the supercomputer

This case study is intended to provide one model for distantreading and the generation of finding aids in the interest of showingsome of the potential for working with born-digital sources It isintended to begin or jumpstart a conversation around the resourcesand skills necessary for the next generation A seemingly insur-mountable amount of information contained across over 73000files (as in the CDC collection) can be quickly navigated and iso-lated for researcher convenience

Conclusion Digital Literacy for the Next Generation ofHistorians

As a discipline I would like to suggest one path forward prepare thenext generation of historians at least some of them to deal with thedeluge of born-digital sources Fundamentally the challenge of born-digital sources is a pedagogical one For existing historians this is anissue of continual education for future historians an issue of profes-sional training (the annual Digital Humanities Summer Instituteheld at the University of Victoria each June is worth considering asan introduction to many digital concepts) Some off-the-shelf tools

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

can serve as gateways to deeper engagement notably the VoyantTools suite (wwwvoyant-toolsorg) and perhaps subsequently thelessons contained in the Programming Historian 2 That said digitalliteracy courses should become part of the essential pantheon of his-torical curricula This should not be mandatory but instead taughtalongside and within traditional historiography and research meth-ods courses In an era of diminishing resources and attention paid toknowledge translation and relevancy digital literacy courses are alsowell positioned to provide graduates with concrete portfolios andactionable skills If we want to be prepared to use digital method-ologies to study the post-1990 world we will need ready historians

Such a course should have six basic priorities First we must pro-vide historiographical context for new students on whatcomputational methods have been tried before How have quantita-tive and qualitative methods differed yet ultimately complementedeach other Second students in this course would need to develop anunderstanding of the digital basics cloud computing effectivebackup and versioning digital security and an understanding of howthe Internet is organized Third for effective cross-departmental andsustained engagement with digital history students must to learn howto digitally navigate conventional sources notable skill sets wouldinclude a familiarity with citation management software databasesand the possibility of using optical character recognition on a varietyof born-paper sources subsequently digitized through scanners orcameras This must also include a grasp of the scope and significanceof large digital depositories including Google Books the InternetArchive and others mentioned earlier Fourth students must learnhow to process large amounts of digital information this includes off-the-shelf visualization tools material discussed here and some of thephilosophical and methodological implications of a distance readingFifth students can learn basic programming through the ProgrammingHistorian Finally and perhaps most importantly students need anunderstanding of various ways to openly disseminate material on the web whether through blogging micro-blogging publishing webpages and so forth it is important to enable the great work that we do in the university to make its mark on the world Throughthis students will also come away from history departments with

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

comprehensive and marketable digital portfolios Such a coursewould focus more on homework than reading with special emphasisplaced on deliverable content such as websites interactive modelsand broadly implementable databases

Digital history and the readiness to work with born-digitalsources cannot stem out of pedagogy alone however Institutionalsupport is necessary While based on anecdotal evidence it appearsthat some university administrators have begun to embrace the shifttowards the digital humanities At the federal level SSHRC has iden-tified the Digital Economy as one of its priority areas58 Yet when itcomes to evaluation doctoral and post-doctoral researchers do nothave a ldquodigital humanitiesrdquo option open to them in terms of assessingcommittees leading to fears that their projects might be lost inbetween disciplinary areas of focus59 In light of the impending crisisof born-digital sources historians should encourage these initiatives

These sorts of digital projects do not always lend themselveswell to traditional publication venues The examples above from theCDC case study for example are best viewed as dynamically mal-leable models this will be one challenge as historians grapple withnew publication models This article argues first why digitalmethodologies are necessary and second how we can realize aspectsof this call The Internet is 20 years old and before historians realizeit we will be faced with the crisis of born-digital information over-load If history is to continue as the leading discipline inunderstanding the social and cultural past decisive movementtowards the digital humanities is necessary When the sea comes inwe must be ready to swim

IAN MILLIGAN is an assistant professor of Canadian history at theUniversity of Waterloo He is a founding co-editor of ActiveHistorycaan editor-at-large of the Programming Historian 2 and has written sev-eral articles in the fields of Canadian youth digital and labour history

IAN MILLIGAN est professeur adjoint en histoire canadienne agrave laUniversity of Waterloo Il est le fondateur et codirecteur du site

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Internet ActiveHistoryca ainsi qursquoun eacutediteur du site ProgrammingHistorian 2 Il a eacutecrit plusieurs articles au sujet de la jeunesse cana-dienne de la numeacuterisation et de lrsquohistoire des travailleurs

Endnotes

1 The question of when the past becomes a ldquohistoryrdquo to be professionallystudied is a tricky one Yet it is worth noting that the still-definitivework on the Baby Boom Generation appeared in 1996 with DougOwram Born at the Right Time A History of the Baby Boom Generationin Canada (Toronto University of Toronto Press 1996) This actuallyfollowed the earlier scholarly monograph Cyril Levitt Children ofPrivilege Student Revolt in the Sixties (Toronto University of TorontoPress 1984) and a host of other laudable works As academicdebatesworks on the 1960s now suggest the past becomes historysooner than we often expect

2 A critical point advanced in the field of literary criticism and theory byStephen Ramsay Reading Machines Towards an Algorithmic Criticism(Urbana University of Illinois Press 2011)

3 See ldquoIdleNoMore Timelinerdquo ongoingwwwidlenomoremakookcatimeline44 ltviewed 18 January 2013gt

4 These issues have been discussed elsewhere albeit in either a moreabridged form or with a focus on digitized historical sources A notableexception is Daniel J Cohen and Roy Rosenzweig Digital History AGuide to Gathering Preserving and Presenting the Past on the Web(Philadelphia University of Pennsylvania Press 2005) specifically theirchapter on ldquoCollecting History onlinerdquo Their focus is on the preservation-ist and archival challenges however Important notes are also found inDan Cohen ldquoHistory and the Second Decade of the Webrdquo RethinkingHistory 8 no 2 (June 2004) 293ndash301 also available online atwwwchnmgmueduessays-on-history-new-mediaessaysessayid=34

5 Figure taken from Jose Martinez ldquoTwitter CEO Dick Costolo Reveals Staggering Number of Tweets Per Dayrdquo ComplexTech wwwcomplexcomtech201210twitter-ceo-dick-costolo-reveals-staggering-number-of-tweets-per-day ltviewed 6 January 2013gtFurther more while a tweet itself is only 140 characters every tweetcomes with over 40 lines of metadata date information author biogra-phy and location information when the account was created the usersrsquofollowing and follower numbers country data and so forth This is aremarkable treasure trove of information for future historians but also apotential source of information overload See Sarah Perez ldquoThis is What

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

a Tweet Looks Likerdquo ReadWriteWeb (19 April 2010) wwwreadwritewebcomarchivesthis_is_what_a_tweet_looks_likephpltviewed 6 January 2013gt

6 This is further discussed in Ian Foster ldquoHow Computation ChangesResearchrdquo in Switching Codes Thinking Through Digital Technology inthe Humanities and the Arts eds Thomas Bartscherer and RoderickCoover (Chicago University of Chicago Press 2011) esp 22ndash4

7 Franco Moretti Graphs Maps Trees Abstract Models for Literary History(New York Verso 2005) 3ndash4

8 While not of prime focus here the longue dureacutee deserves definition Itstands opposite the history of events of instances covering instead avery long time span in an interdisciplinary social science framework In along essay Braudel noted that one can then see both structural crises ofeconomies and structures that constrain human society and develop-ment For more see Fernand Braudel ldquoHistory and the Social SciencesThe Longue Dureacuteerdquo in Fernand Braudel On History trans SarahMatthews (Chicago University of Chicago Press 1980) 27 The essaywas originally printed in the Annales ESC no 4 (OctoberndashDecember1958)

9 A great overview can be found at ldquoText Analysisrdquo Tooling Up for DigitalHumanities 2011 wwwtoolingupstanfordedupage_id=981 ltviewed19 December 2012gt

10 Jean-Baptiste Michel Yuan Kui Shen Aviva Presser Aiden Adrian VeresMatthew K Gray The Google Books Team Joseph P Pickett DaleHoiberg Dan Clancy Peter Norvig Jon Orwant Steven Pinker MartinA Nowak Erez Lieberman Aiden ldquoQuantitative Analysis of CultureUsing Millions of Digitized Booksrdquo Science 331 (14 January 2011)176ndash82 The tool is publicly accessible at ldquoGoogle Books NgramViewerrdquo Google Books 2010 wwwbooksgooglecomngrams ltviewed11 January 2013gt

11 Anthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

12 Comment by Jean-Baptiste Michel and Erez Lieberman Aiden onAnthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

13 For more information see the Canadian Century Research InfrastructureProject website at wwwccriuottawacaCCRIHomehtml and the Programme de recherche en deacutemographie historique at wwwgeneologyumontrealcafrleprdhhtm

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

14 ldquoWith Criminal Intentrdquo CriminalIntentorg wwwcriminalintentorgltviewed 20 December 2012gt

15 For an overview see ldquoAward Recipients mdash 2011rdquo Digging into DataWebpage wwwdiggingintodataorgHomeAwardRecipients2011tabid185Defaultaspx ltviewed 20 December 2012gt

16 One can do no better than the completely free Programming Historianwhich uses entirely free and open-source programming tools to learn thebasics of computational history See The Programming Historian 2 2012-Present wwwprogramminghistorianorg ltviewed 11 January 2013gt

17 Michael Katz The People of Hamilton Canada West Family and Class ina Mid-Nineteenth-Century City (Cambridge MA Harvard UniversityPress 1975) Also see A Gordon Darroch and Michael D OrnsteinldquoEthnicity and Occupational Structure in Canada in 1871 The VerticalMosaic in Historical Perspectiverdquo Canadian Historical Review 61 no 3(1980) 305ndash33

18 Ian Anderson ldquoHistory and Computingrdquo Making History The ChangingFace of the Profession in Britain 2008 wwwhistoryacukmakinghistoryresourcesarticleshistory_and_computinghtml ltviewed 20 December2012gt

19 Christopher Collins ldquoText and Document Visualizationrdquo TheHumanities and Technology Camp Greater Toronto Area VisualizationBootcamp University of Ontario Institute of Technology Oshawa Ont21 October 2011

20 The quotations by Luther and Poe are commonly used to make thispoint For the best example see Clay Shirky Cognitive SurplusCreativity and Generosity in a Connected Age Google eBook Penguin2010 For Mumford see Lewis Mumford The Myth of the Machine vol2 The Pentagon of Power (New York Harcourt Brace 1970) 182 asquoted in Gleick The Information A History A Theory A Flood (NewYork Pantheon 2011) 404

21 Adam Ostrow ldquoCisco CRS-3 Download Library of Congress in JustOver One Secondrdquo Mashable (9 March 2010) accessed onlinewwwmashablecom20100309cisco-crs-3

22 A note on data size is necessary The basic unit of information is a ldquobyterdquo (8 bits) which is roughly a single character A kilobyte (KB) is athousand bytes a megabyte (MB) a million a gigabyte (GB) a billion aterabyte (TB) a trillion and a petabyte (PB) a quadrillion Beyond thisthere are exabytes (EB) zettabytes (ZB) and yottabytes (YB)

23 These figures can of course vary depending on compression methodschosen Their math was based on an average of 8MB per book if it wasscanned at 600dpi and then compressed There are 26 million books inthe Library of Congress For more see Mike Ashenfelder ldquoTransferringlsquoLibraries of Congressrsquo of Datardquo Library of Congress Digital Preservation

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Blog 11 July 2011wwwblogslocgovdigitalpreservation201107transferring-libraries-of-congress-of-data ltviewed 11 January 2013gt

24 Gleick 232 25 ldquoAbout the Library of Congress Web Archivesrdquo Library of Congress Web

Archiving FAQs Undated wwwlocgovwebarchivingfaqhtmlfaqs_02ltviewed 11 January 2013gt

26 ldquo10000000000000000 Bytes Archivedrdquo Internet Archive Blog 26October 2012 wwwblogarchiveorg2012102610000000000000000-bytes-archivedltviewed 19 December 2012gt

27 Jean-Baptiste Michel et al 176ndash82 Published Online Ahead of Print16 December 2010 wwwsciencemagorgcontent3316014176ltviewed 29 July 2011gt See also Scott Cleland ldquoGooglersquos PiracyLiabilityrdquo Forbes 9 November 2011 wwwforbescomsitesscottcle-land20111109googles-piracy-liability ltviewed 11 January 2013gt

28 ldquoGovernment of Canada Web Archiverdquo Library and Archives Canada 17October 2007 wwwcollectionscanadagccawebarchivesindex-ehtmlltviewed 11 January 2013gt

29 For background see Bill Curry ldquoVisiting Library and Archives inOttawa Not Without an Appointmentrdquo Globe and Mail 1 May 2012wwwtheglobeandmailcomnewspoliticsottawa-notebookvisiting-library-and-archives-in-ottawa-not-without-an-appointmentarticle2418960 ltviewed 22 May 2012gt and the announcement ldquoLAC Begins Implementation of New Approach to service Deliveryrdquo Library and Archives Canada February 2012wwwcollectionscanadagccawhats-new013-560-ehtml ltviewed 22May 2012gt For professional responses see the ldquoCHArsquos comment onNew Directions for Library and Archives Canadardquo CHA Website March2012 wwwcha-shccaenNews_39items19html ltviewed 19December 2012gt and ldquoSave Library amp Archives Canadardquo CanadianAssociation of University Teachers launched AprilMay 2012 wwwsavelibraryarchivesca ltviewed 19 December 2012gt

30 Ian Milligan ldquoThe Smokescreen of lsquoModernizationrsquo at Library andArchives Canadardquo ActiveHistoryca 22 May 2012wwwactivehistoryca201205the-smokescreen-of-modernization-at-library-and-archives-canada ltviewed 19 December 2012gt

31 ldquoSave LAC May 2012 Campaign Updaterdquo SaveLibraryArchivesca May2012 wwwsavelibraryarchivescaupdate-2012-05aspx ltviewed 19December 2012gt

32 The challenges and opportunities presented by such projects is extensively discussed in Daniel J Cohen ldquoThe Future of Preserving the Pastrdquo CRM The Journal of Heritage Stewardship 2 no 2

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

(Summer 2005) 6ndash19 also available freely online at the RoyRosenzweig Center for History and New Mediawwwchnmgmueduessays-on-history-new-mediaessaysessayid=39For more see ldquoSeptember 11 Digital Archive Saving the Histories of September 11 2001rdquo Centre for History and New Media andAmerican Social History Project Center for Media and Learningwww911digitalarchiveorg ltviewed 11 January 2013gt ldquoHurricaneDigital Memory Bank Collecting and Preserving the Stories of Katrina and Ritardquo Center for History and New Media wwwhurricanearchiveorg ltviewed 11 January 2013gt and ldquoOccupyArchive Archiving the Occupy Movements from 2011rdquo Centre forHistory and New Media wwwoccupyarchiveorg ltviewed 11 January2013gt

33 Daniel Cohen and Roy Rosenzweig Digital History A Guide to Gathering Preserving and Presenting the Past on the Web(Philadelphia Temple University Press 2005) chapter on ldquoPreservingDigital Historyrdquo The chapter is available fully online atwwwchnmgmuedudigitalhistorypreserving5php ltviewed 19December 2012gt

34 For an overview see ldquoPreserving Our Digital Heritage The NationalDigital Information Infrastructure and Preservation Program 2010 Report a Collaborative Initiative of the Library of CongressrdquoDigitalpreservationgov wwwdigitalpreservationgovmultimediadocumentsNDIIPP2010Report_Postpdf ltviewed 19 December 2012gt

35 Steve Meloan ldquoNo Way to Run a Culturerdquo Wired 13 February 1998available online via the Internet Archivewwwwebarchiveorgweb20000619001705httpwwwwiredcomnewsculture012841030100html ltviewed 19 December 2012gt

36 Some of this has been discussed in greater depth on my website See theseries beginning with Ian Milligan ldquoWARC Files A Challenge forHistorians and Finding Needles in Haystacksrdquo ianmilliganca 12December 2012 wwwianmilliganca20121212warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks ltviewed 20 December2012gt

37 Unstructured data as opposed to structured data is important for histo-rians but brings additional challenges Some projects have producedstructured data novels marked up using XML encoding for example soone can immediately see place names dates character names and soforth This makes it much easier for a computer to read and understand

38 See William Noblett ldquoDigitization A Cautionary Talerdquo New Review ofAcademic Librarianship 17 (2011) 3 Tobias Blanke Michael Bryantand Mark Hedges ldquoOcropodium Open Source OCR for Small-ScaleHistorical Archivesrdquo Journal of Information Science 38 no 1 (2012) 83

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 3: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

posage des donneacutees numeacuteriques le deacuteveloppement des nuages informa-tiques et lrsquoaugmentation parallegravele de la capaciteacute drsquoutiliser ces sourcesbouleverse deacutejagrave la pratique historienne Cet article se veut une eacutetude decas baseacutee sur ces observations Il eacutetudie plus particuliegraverement un projetde numeacuterisation de Bibliothegraveque et Archives Canada mdash les Collectionsnumeacuteriseacutees du Canada mdash pour eacuteclairer certains deacutefis ci-haut mention-neacutes Un ensemble de solutions precirctes agrave utiliser pour lrsquoanalyse de donneacuteesjumeleacute avec un code informatique eacutecrit en Mathematica peut contri-buer agrave retracer le contexte et agrave retrouver des informations agrave partir drsquounecollection numeacuterique preacuteceacutedemment inaccessible aux chercheurs eacutetantdonneacute sa taille Lrsquoarticle se termine par une preacutesentation des diffeacuterentsoutils informatiques accessibles aux historiens ainsi que par un appel agravelrsquoacquisition drsquoune plus grande culture numeacuterique dans les curriculumsen histoire et dans le deacuteveloppement professionnel des historiens

The information produced and consumed by humankindused to vanish mdash that was the norm the default Thesights the sounds the songs the spoken word just meltedaway Marks on stone parchment and paper were the spe-cial case It did not occur to Sophoclesrsquo audiences that itwould be sad for his plays to be lost they enjoyed theshow Now expectations have inverted Everything may berecorded and preserved at least potentially James Gleick The Information A History a Theory aFlood 396ndash7

Almost every day almost every Canadian generates born-digital infor-mation that is fast becoming a sea of data Historians must learn howto navigate this sea of digital material Where will the historian of thefuture go to research todayrsquos history What types of sources will theyuse We are currently experiencing a revolutionary medium shift Asthe price of digital storage plummets and communications are increas-ingly carried through digitized text pictures and video the primarysources left for future historians will require a significantly expandedand rethought toolkit The need to manage a plethora of born-digitalsources (those originally digitally created) as an essential part of historywill be on us soon While there is no commonly accepted rule-of-

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

thumb for when a topic becomes ldquohistoryrdquo it is worth noting as anexample that it took less than 30 years after the tumultuous year of1968 for a varied developed and contentious North American histo-riography to appear on the topic of life in the 1960s1 In 2021 it willhave been a similar 30 years since in August 1991 Tim Berners-Leea fellow at the European Organization for Nuclear Research (CERN)published the very first website and launched the World Wide WebProfessional historians need to be ready to add and develop new skillsto deal with sources born in a digital age

Digital sources necessitate a rethinking of the historianrsquos toolkitBasic training for new historians requires familiarity with newmethodological tools and making resources for the acquisition ofthese tools available to their mentors and teachers Not only willthese tools help in dealing with more recent born-digital sources butthey will also help historians capitalize on digitized archival sourcesfrom our more-distant past This form of history will not replace ear-lier methodologies but instead play a meaningful collaborative andsupportive role It will be a form of distant reading encompassingthousands or tens of thousands of sources that will complement themore traditional and critically important close reading of smallbatches of sources that characterizes so much of our work Indeedmuch of what we think of as ldquodigital historyrdquo may simply be ldquohis-toryrdquo in the years to come2

There are a number of avenues along which this transformationwill take place First it is important to establish a map that situatesthis emerging field into what has come before Second I examine thecurrent situation of born-digital sources and historical practiceThird I use an early born-digital archive from Library and ArchivesCanada (LAC) Canadarsquos Digital Collections project (CDC) to eval-uate how we access and make sense of digital information today Ithen apply emergent visualization techniques to the sources in thisarchive to show one way through which historians can tackle born-digital source issues Lastly I will lay out a road map to ensure thatthe next generation has enough basic digital literacy to ensure thathistorians can adequately analyze this new form of source material

Imagine a future historian taking on a central question of socialand cultural life in the middle of the first decade of the twenty-first

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

century such as how Canadians understood Idle No More or theOccupy movement through social media What would her archivelook like I like to imagine it as boxes stretching out into the dis-tance tapering off without any immediate reference points to bringit into perspective Many of these digital archives will be withouthuman-generated finding aids would have perhaps no dividers orheadings within ldquofilesrdquo and certainly no archivist with a compre-hensive grasp of the collection contents A recent example isillustrative During the IdleNoMore protests Twitter witnessed anastounding 55334 tweets on 11 January 2013 Each tweet can be up to 140 characters To put that into perspective this article is lessthan 900 lines long in a word processor Yet this massive amount ofinformation looks cryptic strange usernames no archives no fold-ers not even an immediate way to see whether a given tweet isrelevant to your research or not3 This is a vast collection Informationoverload

This example is indicative of the issues facing the historical pro-fession4 I am not exaggerating Every day half a billion tweets aresent5 hundreds of thousands of comments are uploaded to newssites and scores of blog posts are uploaded This makes up one smallaspect of broader data sets automated logs including climate read-ings security access reports library activity search patterns booksand movies ordered from Amazonca or Netflix6 Even if only a smallfraction of this is available to future historians it will represent anunparalleled source of information about how life was lived (orthought to have been lived) during the early-twenty-first century Toprovide another perhaps more tangible example while carrying outmy previous work on youth cultures in 1960s English-Canada I wasstruck by how inaccessible television sources were compared to news-paper records despite television being arguably the primary media ofthe time Unlike the 1960s however the Internet leaves a plethoraof available mdash if difficult to access mdash primary sources for historiansIn the digital age archiving will be different which means that thehistory we write will be different

We must begin planning for this plethora of born-digital sourcesnow Using modern technology we have the ability to quickly findneedles in digital haystacks to look inside a lsquoboxrsquo and immediately

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

find that relevant letter rather than spending hours searching andreading everything Yet we also have the ability to pull our gaze backdistantly elucidating the context to our relevant documents as well

The Necessity of Distant Reading Historians andComputational History

Franco Moretti in his 2005 groundbreaking work of literary criti-cism Graphs Maps Trees called on his colleagues to rethink theircraft With the advent of mass produced novels in nineteenth-cen-tury Britain Moretti argued that scholars must change theirdisciplinary approach Traditionally literary scholars worked with acorpus of around 200 novels though an impressive number for theaverage person it represents only around one percent of the totaloutput of that period To work with all of this material in an effortto understand these novels as a genre critics needed to approachresearch questions in a different way

[C]lose reading wonrsquot help here a novel a day every day ofthe year would take a century or so hellip And itrsquos not evena matter of time but of method a field this large cannotbe understood by stitching together separate bit of knowl-edge about individual cases because it isnrsquot a sum ofindividual cases itrsquos a collective system that should begrasped as such as a whole7

Instead of close reading Moretti called for literary theorists to prac-tice ldquodistant readingrdquo

For inspiration Moretti drew on the work of French Annaleshistorian Fernand Braudel Braudel amongst the most instrumentalhistorians of the twentieth century pioneered a similarly distantapproach to history his inquiries spanned large expanses of time andspace that of civilizations the Mediterranean world as well assmaller (only by comparison) regional histories of Italy and FranceHis distant approach to history did not necessitate disengagementwith the human actors on the ground seeing instead that beyond thenarrow focus of individual events lay the constant ebb and flow of

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

poverty and other endemic structural features8 While his methodol-ogy does not apply itself well to rapidly changing societies (his focuswas on the long-term slow change) his points around distant read-ing and the need for collaborative interdisciplinary research are auseful antecedent Similar opportunities present themselves whendealing with the vast array of born-digital sources (as well as digitizedhistorical sources) we can distantly read large arrays of sources froma longue dureacutee of time and space mediated through algorithms andcomputer interfaces opening up the ability to consider the voices ofeverybody preserved in the source

Historians must be open to the digital turn due to the astound-ing growth of digital sources and an increasing technical ability toprocess them on a mass scale Social sciences and humanitiesresearchers have begun to witness a profound transformation in howthey conduct and disseminate their work Indeed historical digitalsources have reached a scale where they defy conventional analysisand now require computational analysis Responding to thisarchives are increasingly committed to preserving cultural heritagematerials in digital forms

I would like to pre-emptively respond to two criticisms of thisdigital shift within the historical profession First historians will notall have to become programmers Just as not all historians need a firmgrasp of Geographical Information Systems (GIS) or a developedunderstanding of the methodological implications of community-based oral history the incorporation of a transnational perspective orin-depth engagement with cutting edge demographic models not allhistorians have to approach their trade from a computational per-spective Nor should they Digital history does not replace closereading traditional archival inquiry or going into communities mdash touse only a few examples mdash to uncover notions of collective memoryor trauma Indeed digital historians will play a facilitative role andprovide a broader reading context Yet they will still be historians col-lecting relevant sources analyzing and contextualizing them situatingthem in convincing narratives or explanatory frameworks and dis-seminating their findings to wider audiences

Neither will historians replace professional programmers Insteadas with other subfields historians using digital sources will need to be

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

prepared to work on larger truly interdisciplinary teams Computer sci-entists statisticians and mathematicians for example have criticalmethodological and analytical contributions to make towards questionsof historical inquiry Historians need to forge further links with thesedisciplines Such meaningful collaboration however will require histo-rians who can speak their languages and who have an understanding oftheir work Some of us must be able to discuss computational or algo-rithmic methodologies when we collectively explore the past

The methodologies discussed in this paper fall into the twinfields of data mining and textual analysis The former refers to thediscovery of patterns or information about large amounts of infor-mation while the latter refers to the use of digital tools to analyzelarge quantities of text9 The case study here draws upon three mainconcepts information retrieval (quickly finding what you want)information and term frequency (how often do various concepts andterms appear) and topic modeling (finding frequently occurringbags of words throughout documents) Other projects have involvedauthorship attribution which is finding out the author of docu-ments based on previous patterns style analysis or creating networkgraphs of topics groups documents or individuals

The Culturomics project run by the Cultural Observatory atHarvard University is a notable model of collaboration on a big his-tory project As laid out in Science and put online as the GoogleBooks n-gram viewer (wwwbooksgooglecomngrams) this projectindexed word and phrase frequency across over five million booksenabling researchers to trace the rise and fall of cultural ideas andphenomena through targeted keyword and phrase searches10 This isthe most ambitious and certainly the most advanced and accessibleBig History project yet launched Yet amongst the extensive authorlist (13 individuals and the Google Books team) there were no his-torians This suggests that other disciplines in the humanities andsocial sciences have embraced the digital turn far more than histori-ans However as I mentioned in the coming decades historians willincreasingly rely on digital archives There is a risk that history as aprofessional discipline could be left behind

American Historical Association President Anthony Graftonnoted this absence in a widely circulated Perspectives article Why was

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

there no historian he asked when this was both primarily a histori-cal project and a team comprised of varied doctoral holders fromdisciplines such as English literature psychology computer sciencebiology and mathematics12 Culturomics project leaders ErezLeiberman Aiden and Jean-Baptiste Michel responded franklyWhile the team approached historians and some did play advisoryroles every author listed ldquodirectly contributed to either the creationof the collection of written texts (the lsquocorpusrsquo) or to the design andexecution of the specific analyses we performed No academic histo-rians met this barrdquo They continued to indict the profession

The historians who came to the meeting were intelligentkind and encouraging But they didnrsquot seem to have agood sense of how to wield quantitative data to answerquestions didnrsquot have relevant computational skills anddidnrsquot seem to have the time to dedicate to a big multi-author collaboration Itrsquos not their fault these things donrsquotappear to be taught or encouraged in history departmentsright now13

This is a serious indictment considering the pending archival shift todigital sources and one that still largely holds true (a few historydepartments now offer digital history courses but they are still out-liers) This form of history the massive analysis of digitizedinformation will continue Furthermore these approaches willattract public attention as they lend themselves well to accessibleonline deployment Unless historians are well positioned they riskceding professional ground to other disciplines

While historians were not involved in Culturomics they arehowever involved in several other digital projects Two major andambitious long-term Canadian projects are currently using compu-tational methods to reconstruct the demographic past The first isthe Canadian Century Research Infrastructure project which is cre-ating databases from five censuses and aiming to provide ldquoa newfoundation for the study of social economic cultural and politicalchangerdquo while the second is the Programme de recherche en deacutemo-graphie historique hosted by the Universiteacute de Montreacuteal13 They are

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

reconstructing the European population of Quebec in the seven-teenth and eighteenth centuries drawing heavily on parish registersFurthermore several projects in England have harnessed the wave ofinterest in ldquobig datardquo to tell compelling historical stories about theirpasts the Criminal Intent project for example used the proceedingsof the Old Bailey courts (the main criminal court in Greater London)to visualize large quantities of historical information14 The CriminalIntent project was itself funded by the Digging into Data Challengewhich brings together multinational research teams and fundingfrom federal-level funding agencies in Canada the United States theUnited Kingdom and beyond Several other historical studies havebeen funded under their auspices from studies of commodity tradein the eighteenth- and nineteenth-century Atlantic world (involvingscholars at the University of Edinburgh the University of StAndrews and York University) to data mining the degrees of eco-nomic opportunity and spatial mobility in Britain Canada and theUnited States (bringing researchers at the University of Guelph theUniversity of Leicester and the University of Minnesota together)15

These exciting interdisciplinary teams may be heralding a disciplinalshift although it is still early days

Collaborative teams have limitations though We will have tobe prepared to tackle digital sources alone on shoestring budgetsHistorians will not always be able to call on external programmers orspecialists to do work because teams often develop out of fundinggrants As governmental and university budgets often prove unableto meet consistent research needs baseline technical knowledge isrequired for independent and unfunded work If historians relyexclusively on outside help to deal with born-digital sources the abil-ity to do computational history diminishes As the success of theProgramming Historian indicates and as I will demonstrate in mycase study below historians can learn basic techniques16 Let ustweak our own algorithms

A social or cultural history of the very late-twentieth or twenty-first century will have to account for born-digital sources andhistorians will (in conjunction with archival professionals) have tocurate and make sense of the data Historians need to be able toengage with these sources themselves or at least as part of a team

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

rather than having sources mediated through the lens of outside dis-ciplines or commercial interests Algorithms may be one meaningfulway to make sense of cultural trends or extract meaningful informa-tion from billions of tweets but we must ensure professionalhistorical input goes into their crafting

The Third Wave of Computational History

Historians have fruitfully used computers in the past as part of twoprevious waves The 1960s and 1970s saw computational tools usedfor demographic population and economic histories a trend that inCanada saw full realization in comprehensive studies such as MichaelKatzrsquos The People of Hamilton Canada West17 As Ian Anderson hasconvincingly noted in a review of history and computing studies ofthis sort saw the field become associated with quantitative studies18

After a retreat from the spotlight computational history rose again inthe 1990s with the advent of the personal computer graphical userinterfaces and overall improvements in ease of use Punch cards couldnow give way to database programs GIS and early online networkssuch as H-Net and Usenet We are now I believe on the cusp of athird revolution in computational history thanks to three main fac-tors decreasing storage costs the power of the Internet anddistributed cloud computing and the rise of professionals dealingwith both digital preservation and open-source tools

Decreasing storage costs have led to massive amounts of infor-mation being preserved as Gleick noted in this articlersquos epigraph Itis important to note that this information overload is not newPeople have long worried about the impact of too much informa-tion19 In the sixteenth century responding to the rise of the printingpress the German priest Martin Luther decried that the ldquomultitudeof books [were] a great evilrdquo In the nineteenth century Edgar AllanPoe bemoaned ldquo[t]he enormous multiplication of books in everybranch of knowledge is one of the greatest evils of this agerdquo Asrecently as 1970 American historian Lewis Mumford lamented ldquotheoverproduction of books will bring about a state of intellectual ener-vation and depletion hardly to be distinguished from massiveignorancerdquo20 The rise of born-digital sources must thus be seen in

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

this continuous context of handwringing around the expansion andrise of information

Useful historical information is being preserved at a rate that isaccelerating with every passing day One useful metric to comparedataset size is that of the American Library of Congress (LOC)Indeed a ldquoLOCrdquo has become shorthand for a unit of measurementin the digital humanities Measuring the actual extant of the printedcollection in terms of data is notoriously difficult James Gleick hasclaimed it to be around 10TB22 whereas the Library of Congressitself claims a more accurate figure is about 200TB23 While thesetwo figures are obviously divergent neither is on an unimaginablescale mdash personal computers now often ship with a terabyte or moreof storage When Claude Shannon the father of information theorycarried out a thought experiment in 1949 about items or places thatmight store ldquoinformationrdquo (an emerging field which viewed messagesin their aggregate) he scaled sites from a digit upwards to the largestcollection he could imagine The list scaled up logarithmically from10 to the power of 0 to 10 to the power of 13 If a single-spaced pageof typing is 104 107 the Proceedings of the Institute of Radio Engineers109 the entire text of the Encyclopaedia Brittanica then 1014 was ldquothelargest information stockpile he could think of the Library ofCongressrdquo24

But now the Library of Congressrsquo print collection mdash somethingthat has taken two centuries to gather mdash is dwarfed by their newestcollection of archived born-digital sources the vast majority only amatter of years old compared to the much wider date-range of non-digital sources in the traditional collection The LOC has beguncollecting its own born-digital Internet archive multiple snapshotsare taken of webpages to show how they change over time Even asa selective curated collection drawing on governmental politicaleducational and creative websites the LOC has already collected254TB of data and adds 5TB a month25 The Internet Archivethrough its Wayback Machine is even more ambitious It seeks toarchive every website on the Internet While its size is also hard toput an exact finger on as of late 2012 it had over 10 Petabytes ofinformation (or to put it into perspective a little over 50 Library ofCongresses if we take the 200TB figure)26

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

This is to some degree comparing apples and oranges theLibrary of Congress is predominantly print whereas the InternetArchive has considerable multimedia holdings which includesvideos images and audio files The LOC is furthermore a morecurated collection whereas the Internet Archive draws from a widerrange of producers The Internet offers the advantages and disadvan-tages of being a more democratic archive For example a Geocitiessite (an early way for anybody to create their own website) created bya 12-year old Canadian teenager in the mid-1990s might be pre-served by the Internet Archive whereas it almost certainly would notbe saved by a national archive This difference in size and collectionsmanagement however is at the root of the changing historiansrsquotoolkit If we make use of these large data troves we can access a newand broad range of historical subjects

Conventional sources are also part of this deluge Google Bookshas been digitizing the printed word and it currently has 15 millionworks completed by 2020 it audaciously aims to digitize every bookever written While copyright issues loom the project will allowresearchers to access aggregate word and phrase data as seen in theGoogle n-gram project27 In Canada Library and Archives Canada(LAC) has a more circumscribed digitization project collecting ldquoarepresentative sample of Canadian websitesrdquo focusing particularlyon Government of Canada webpages as well as a curated approach

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

preserving case studies such as the 2006 federal election Currentlyit has 4TB of federal government information and 7TB of totalinformation in its larger web archive (a figure which as I note belowis unlikely to drastically improve despite ostensible goals of archivallsquomodernizationrsquo)28

In Canada LAC has mdash on the surface mdash has been advancingan ambitious digitization program as it realizes the challenges nowfacing archivists and historians Daniel J Caron the Librarian andArchivist of Canada has been outspoken on this front with severalpublic addresses on the question of digital sources and archivesThroughout 2012 primary source digitization at LAC preoccupiedcritics who saw it as a means to depreciate on-site access TheCanadian Association of University Teachers the Canadian HistoricalAssociation as well as other professional organizations mountedcampaigns against this shift in resources29 LAC does have a pointhowever with their modernization agenda The state of Canadarsquosonline collections are small and sorely lacking when compared totheir on-site collections and LAC does need to modernize and achievethe laudable goal of reaching audiences beyond Ottawa

Nonetheless digitization has unfortunately been used as asmokescreen to depreciate overall service offerings30 The modern-ization agenda would also see 50 percent of the digitization staffcut31 LACrsquos online collections are small they do not work withonline developers through Application Programming Interfaces(APIs a way for a computer program to talk to another computerand speed up research mdash Canadianaca is incidentally a leader in thisarea) and there have been no promising indications that this state ofaffairs will change Digitization has not been comprehensive Onlinepreservation is important and historians must fight for both on-siteand on-line access

A number of historians have recognized this challenge and areplaying an instrumental role in preserving the born-digital past TheRoy Rosenzweig Center for History and New Media at GeorgeMason University launched several high-profile preservation pro-jects the September 11th Digital Archive the Hurricane DigitalMemory Bank (focusing on Hurricanes Katrina and Rita) and as ofwriting the Occupy archive32 Massive amounts of online content

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

is curated and preserved photographs news reports blog posts andnow tweets These complement more traditional efforts of collectingand preserving oral histories and personal recollections which arethen geo-tagged (tied to a particular geographical location) tran-scribed and placed online These archives serve a dual purposepreserving the past for future generations while also facilitating easydissemination for todayrsquos audiences

Digital preservation however is a topic of critical importanceMuch of our early digital heritage has been lost Many websites espe-cially those before the Internet Archiversquos 1996 web archiving projectlaunch are completely gone Early storage mediums (from tape toearly floppy disks) have become obsolete and are now nearly inac-cessible Compounding this software development has seen earlyproprietary file formats depreciated ignored and eventually madewholly inaccessible33 Fortunately the problem of digital preservationwas recognized by the 1990s and in 2000 the American governmentestablished the National Digital Information Infrastructure andPreservation Program34 As early as 1998 digital preservationistsargued ldquo[h]istorians will look back on this era hellip and see a period ofvery little information A lsquodigital gaprsquo will span from the beginningof the wide-spread use of the computer until the time we eventuallysolve this problem What wersquore all trying to do is to shorten thatgaprdquo To bear home their successes that quotation is taken from anInternet Archive snapshot of Wired magazinersquos website35

Indeed greater attention towards rigorous metadata (which isbriefly put data about data often preserved in plain text format)open-access file formats documented by the InternationalOrganization for Standardization (ISO) and dedicated digitalpreservation librarians and archivists gives us hope for the future36

It is my belief that cloud storage the online hosting of data in largethird-party offsite centres mitigates the issue of obsolete storagemedium Furthermore that the move towards open source file for-mats even amongst large commercial enterprises such as Microsoftprovides greater documentation and uniformity

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A Brief Case Study Accessing and Navigating Canadarsquos DigitalCollections With an Emphasis on Tools for Born-DigitalSources

What can we do about this digital deluge There are no simpleanswers but historians must begin to conceptualize new additions totheir traditional research and pedagogical toolkits In the section thatfollows I will do two things introduce a case study and several toolsboth proprietary and those that are open and available off-the-shelfto reveal one way forward

For a case study I have elected to use Canadarsquos DigitalCollections (CDC) a collection of lsquodeadrsquo websites removed fromtheir original location on the World Wide Web and now digitallyarchived at LAC In this section I will introduce the collection usingdigitally-archived sources from the early World Wide Web and dis-cuss my research methodology before illustrating how we can derivemeaningful data from massive arrays of unstructured data37

This is a case study of a web archive Several of these method-ologies could be applied to other formats including large arrays oftextual material whether born-digital or not A distant reading of thistype is not medium specific It does however assume that the text ishigh quality Digitized primary documents do not always have textlayers and if they are instituted computationally through OpticalCharacter Recognition errors can appear Accuracy rates for digitizedtextual documents can range from as low as 40 percent for seven-teenth and eighteenth century documents to 80ndash90 percent formodern microfilmed newspapers to over 99 percent for typesetword-processed documents38 That said while historians today maybe more interested in applying distant reading methodologies tomore traditional primary source materials it is my belief that in thenot-so-distant future websites will be an increasingly significantsource for social cultural and political historians

What are Canadarsquos Digital Collections In 1996 IndustryCanada funded by the federal Youth Employment Strategy facili-tated the development of Canadian heritage websites by youth(defined as anyone between the age of 15 and 30) It issued contractsto private corporations that agreed to use these individuals as web

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

developers for the twofold purpose of developing digital skills andinstilling respect for and engagement with Canadian heritage andpreservation39 The collection initially called SchoolNet butrebranded as Canadarsquos Digital Collections in 1999 was accessiblefrom the CDC homepage After 2004 the program was shut downand the vast majority of the websites were archived on LACrsquos web-page This collection is now the largest born-digital Internet archivethat LAC holds worthy of study as content as well as thinkingabout methodological approaches

Conventional records about the program are unavailable in thetraditional LAC collection unsurprising in light of both its recentcreation and backlogs in processing and accessioning archival mate-rials We can use one element of our born-digital toolkit the InternetArchiversquos WaybackMachine (its archive of preserved early webpages)to get a sense of what this project looked like in its heyday The mainwebpage for the CDC was wwwcollectionsicgc However a visitortoday will see nothing but a collection of broken image links (two ofthem will bring you to the English and French-language versions ofthe LAC-hosted web archive although this is only discerniblethrough trial and error or through the sitersquos source code)

With the WaybackMachine we have a powerful albeit limitedopportunity to ldquogo back in timerdquo and experience the website as previ-ous web viewers would have (if a preserved website is available ofcourse) Among the over 420 million archived websites we can searchfor and find the original CDC site Between 12 October 1999 andthe most recent snapshot or web crawl (of the now-broken site) on 6July 2011 the site was crawled 410 times40 Unfortunately every siteversion until the 15 October 2000 crawl has been somehow cor-rupted Content has been instead replaced by Chinese characterswhich translate only into place names and other random words Issuesof early digital preservation noted above have reared their head inthis collection

From late 2000 onward however fully preserved CDC sites areaccessible Featured sites new entries success stories awards curric-ular units and prominent invitations to apply for project fundingpopulate the brightly coloured website The excitement is palpableJohn Manley then Industry Minister declared ldquoit is crucial that

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Canadarsquos young people get the skills and experience they need to par-ticipate in the knowledge-based economy of the futurerdquo and thatCDC ldquoshows that young people facing significant employment chal-lenges can contribute to and learn from the resources available onthe Information Highwayrdquo41 The documentation reflecting thelower level of technical expertise and scarcity of computer hardwareat the time was comprehensive participants were encouraged toidentify a collection digitize it process it (by cleaning up the data)preserve it on a hard drive and display it on the web Issues of copy-right were discussed as well as preservation issues and hardwarerequirements for computers (a computer with an Intel 386 processor8MB RAM a 1GB hard drive and 288bps modem was recom-mended) and training information for employers educators andparticipants42 Sample proposals community liaison ideas amidstother information set out a very comprehensive roadmap for a grass-roots bottom-up approach to digital preservation and heritage

Moving through the preserved websites is an interesting snap-shot at how web design standards changed Colour disappearsreplaced by a more subdued coordinated federal website standard(still largely in effect today) Surveys greet viewers from 2003onwards trying to learn how viewers discovered the site whetherthey think it is a good educational resource (they must have beenrelieved that the majority thought so mdash in 2003 some 84 percentfelt that it was a ldquogoodrdquo resource the best of its type)43 By 2004 theprogram was engaging in active remembrance of its past programsas it neared its ten-year anniversary Proposals stopped beingaccepted in July 2004 but the 600 webpages were still accessiblewith a comprehensive subject index alphabetical listing success sto-ries and detailed copyright information

The Canadarsquos Digital Collections site began to move towardsremoval in November 2006 and was initially completely gone A visi-tor to the site would learn that the content ldquois no longer availablerdquo andan e-mail address was provided for contributors to get their informationback44The information was completely inaccessible and by early 2007a bold notice announced that the now-empty site would close as of 2April 2007 Luckily by April 2007 a notice appeared with a declara-tion that the collections were now archived by Library and Archives

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Canada just in time for the website to completely disappear by July2007 In the archived version subject indexes and headings were goneAll that remained then and today for that matter is a straight-forwardalphabetical list and a disclaimer that noted ldquo[y]ou are viewing a doc-ument archived by Library and Archives Canada Please noteinformation may be out of date and some functionality lostrdquo

The question now is what we can do with this collection ofwebsites Could a distant reading as opposed to a close reading ona targeted level tell us something about this collectionFortunately we have a variety of tools that can help us navigate thislarge array of information several of which I use to navigate largeamounts of data First I create my own tools through computerprogramming This work is showcased in the following section I program in Wolfram Researchrsquos Mathematica 9 Mathematica is anintegrated platform for technical computing that allows users toprocess visualize and interact with exceptional arrays of informa-tion45 It is a language with an engaged and generous user base thatcomplements exhaustive well-written documentation fromWolfram Research To emphasize its ease one line of code couldimport a website in perfectly readable plaintext (avoiding unicodeerrors) as follows input=Import[ldquohttpactivehis-torycaldquordquoPlaintextrdquo] and a second one could make itall in lower text for textual analysis as such lowerpage=ToLowerCase[input] Variables are a critical building blockIn the former example ldquoinputrdquo now holds the entire plaintext of thewebsite ActiveHistoryca and then ldquolowerpagerdquo holds it in lowercaseOnce we have lower-cased information we are quickly able to extractthe following frequent words frequent phrases and could thus cre-ate a dynamic database of n-grams word clouds or run other textualanalysis programs on the material Much of the syntax adheres to arecognizable English-language format

Even more promising is Mathematicarsquos integration with theWolfram|Alpha database (itself accessible at wolframalphacom) acomputational knowledge engine that promises to put incredibleamounts of information at a userrsquos fingertips With a quick com-mand one can access the temperature at Torontorsquos Lester B PearsonInternational Airport at 500 pm on 18 December 1983 the rela-

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

tive popularity of names such as ldquoEdithrdquo or ldquoEmmardquo the Canadianunemployment rate in March 1985 or the Canadian GrossDomestic Product in December 199546 This allows one to quicklycross-reference findings with statistical information or find coinci-dental points of comparison As universities are now being activelyencouraged to contribute their information to Wolfram|Alpha thiswill become an increasingly powerful database

For many researchers the downside of Mathematica is a con-siderable one It is proprietary software and relatively expensive forstudent and non-student alike Outputs however can be dissemi-nated using the Computational Document Format (CDF) enablinghistorians to put their work online and entice the general public todynamically work with their models and findings

Programming is not the be all and end all Information visual-ization in the humanities is a growing international field withCanadian scholars heavily involved The most important and acces-sible example is Voyant Tools47 It is a powerful suite of textualanalysis tools Uploading onersquos own text or collection of texts draw-ing on a webpage or analyzing existing bodies of work one sees thefollowing a ldquoword cloudrdquo of a corpus click on any word to see itsrelative rise and fall over time (word trends) access the original textand click on any word to see its contextual placement within indi-vidual sentences It ably facilitates the ldquodistancerdquo reading of anyreasonably-sized body of textual information and should be a begin-ning point for those interested in this field The only downside isthat Voyant can choke on very large amounts of information as it isa textual analysis rather than robust data mining tool

There are several other complementary tools A notable one is theMany Eyes website hosted by IBM Research Many Eyes allows you toupload datasets be they textual tabular or numerical There are manyvisualization options provided you can plot relationships betweendata compare values (ie bar charts and bubble charts) track dataover time and apply information to maps The only downside toMany Eyes is that you must register and make your datasets available toall other registered researchers as well as IBM This limits your abilityto use copyrighted or proprietary information

More advanced technical users can avail themselves of a host of

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

textual analysis programs found elsewhere on the web The mostcomprehensive gathering of these tools can be found with theSoftware Environment for the Advancement of Scholarly Research(SEASR) and its Meandre workbench Meandre which can beinstalled as a local server on your system or mdash in a best case scenariomdash hosted by an institutional IT department allows you to tap intoa variety of workflows to analyze data48 One workflow mentionedin this article is the topic modeling toolkit found within the MachineLearning for Language Toolkit (MALLET) package which clustersunstructured data into ldquotopicsrdquo which appear together giving a quicksense of a documentrsquos structure49 Other workflows found withinSEASR include Dunningrsquos log-likelihood statistic which compares aselected document to a broader reference corpus For example youcould compare the lyrics that appear in 1968 against all lyrics in thepostwar era and learn what makes 1968 most unique50 sentimentanalyses which attempt to plot changes in emotional words andphrases over time and a variety of entity extraction algorithms(which find ldquopeoplerdquo ldquoplacesrdquo and so forth in a text) Corpuses thatwere previously too big to extract meaningful information from cannow be explored relatively quickly

As befits a technical side to history all of the freely availabletools discussed briefly here (notably wget SEASR and MALLET)are discussed in more technical terms in the peer-reviewedProgramming Historian 2 or on my own blog at ianmilliganca I haveelected to leave them online rather than provide commands in thisarticle they require some technical knowledge and directly workingfrom the web is quicker than here Please consider the following sec-tion an applied version of some of the lessons you can learn throughthese processes

From Web Collection to Dynamic Finding Aids

Let us briefly return to our archival metaphor Todayrsquos historiandealing with conventional sources often has to physically travel to anarchive requisition boxes perhaps take photographs of materials iftime is tight and otherwise take notes With born-digital sourceshowever the formula is reversed Our sources can come to us In a

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

few keystrokes we can bring an unparalleled quantity of informationto our home or work computers With Canadarsquos Digital Collectionsa single command line entry brought the entire archive onto my sys-tem it took days to process the information and I had to leave mycomputer on but it did not require any active user engagement

How big is the CDC collection in terms of sheer informationThe plaintext without any HTML markup is approximately7805135 words If each page contained 250 words it would be theequivalent of a 31220 page essay or 1000 dissertations However itis even more complicated than just reading 1000 dissertations Theinformation is spread across 73585 separate HTML files themselvespart of a much broader corpus of 278785 files Even the sheer sizeof the database requires big data methodologies an aggregate size of726 GB 360 MB of plaintext with HTML encoding reduced to alsquomerersquo 50MB if all encoding is removed A 50MB plain text docu-ment can quickly slow down even a relatively powerful personalcomputer There is no simple way to make this a conventionallyaccessible amount of information

How can we access this information Several different methodsare possible First we can work with the information on the internetitself or we can download it onto our own system I prefer the lattermethod as it decreases computational time adds flexibility andenables free rein to experiment with material Before you do thisthough you need to be familiar with the websitersquos ldquoterms of servicerdquoor the legal documentation that outlines how users are allowed to usea website resources or software In this case LAC provides generousnon-commercial reproduction terms information ldquomay be repro-duced in part or in whole and by any means without charge orfurther permission unless otherwise specifiedrdquo51

The CDC web archive index is organized like so the index islocated at wwwepelac-bacgcca100205301iccdcEAlphabetaspwith every subsequent website branching off the cdc folder Forexample one website would be hosted in wwwepelac-bacgcca100205301iccdcABbenevolat the next in hellipcdcabnatureand so on Using wget a free software program you can downloadall websites in a certain structure the program recursively downloadsinformation in a methodical fashion I first used wget to save the

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

cookies (little files kept on your system in this case one that notedthat you had seen the disclaimer about the archival nature of thewebsite) and then with an one-line command in my command lineto download the entire database52 Due to the imposition of a ran-dom wait and bandwidth limit which meant that I would downloadone file every half to one and a half seconds at a maximum of 30kilobytes a second the process took six days In the ProgrammingHistorian 2 I have an article expanding upon wget for humanitiesresearchers

Hypertext itself deserves a brief discussion The basic buildingblock of the web is the link one page is connected to another andpeople navigate through these webs While books and articles forexample are connected to each other by virtue of ideas and concretecitations these must often be inferred online HTML born-digitalsources are different in that the link offers a very concrete file One

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Visualizing the CDC Collection as a Series of Hyperlinks

quick visualization the circular cluster seen above lets us see theunique nature of this information We can go to each web pagescrape the links and see where they take viewers As we see in thebelow figure the CDC collection can be visualized as a series ofhyperlinks (each line) The cluster in the middle is the CDC collec-tion itself whereas the strands on the outside represent externalwebsites Using Mathematica I have generated a dynamic modelyou can move your mouse over each line learn what link it is andthus see where the collection fits into the broader Internet The siteis limited to following two links beyond the first page Through thisapproach we see the interconnection of knowledge and relationshipsbetween different sources At a glimpse we can see suggestions to fol-low other sites and determine if any nodes are shared amongst thevarious websites revealed

Visualization Moving Beyond Word Clouds

Where to begin with a large collection such as the Canadarsquos DigitalCollection corpus As a historian who is primarily interested in thewritten word my first step was to isolate the plaintext throughout thematerial so that I could begin rudimentary data mining and textualanalysis on the corpus For this I approached the CDC collection ontwo distinct levels mdash first the aggregate collection and second divid-ing the collection into each lsquofilersquo or individual website mdash derivingword and phrase data for each Imagine if you could look throughinto an archival collection and get a quick sense of what topics andideas were covered Taking that thought further imagine the utility ifyou could do this with individual archival boxes and even files Witha plethora of born-digital sources this is now a necessary step

On an aggregate level word and phrase data is a way to explorethe information within How often do various words within thenearly eight million words of the collection appear UsingMathematica we can quickly generate a list of words sorted by theirrelative frequency we learn that the word ldquoCanadardquo appears 26429times ldquocollectionsrdquo 11436 and ldquo[L]ouisbourgrdquo an astounding9365 times Reading such a list can quickly wear out a reader and isof minimal visual attractiveness Two quick decisions were made to

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

enhance data usability first the information was ldquocase-foldedrdquo allsent to lower-case this enables us to count the total appearances of aword such as ldquodigitalrdquo regardless of whether or not it appears at thebeginning of a sentence

Second when dealing with word frequency mdash as opposed tophrase frequency mdash ldquostop wordsrdquo should be removed as they werehere These are ldquoextremely common words which would appear to beof little value in helping select documents matching a user needrdquowords such as ldquotherdquo ldquoandrdquo ldquotordquo and so forth They matter forphrases (the example ldquoto be or not to berdquo is comprised entirely ofstop words) but will quickly dominate any word-level visualizationor list53 Other words may need to be added or removed from stoplists depending on the corpus for example if the phrase ldquopage num-berrdquo appeared on every page that would be useless information forthe researcher

The first step then is to use Mathematica to format word fre-quency information into Wordlenet readable format (if we put allof the text into that website it will crash mdash so we need to followthe instructions on that website) we can then draw up a wordcloud of the entire collection Here we can see both the advantagesand disadvantages of static word clouds What can we learn from this

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A Word Cloud of All Words in the CDC Corpus

illustration First without knowing anything about the sites we canimmediately learn that it is dealing with CanadaCanadians that itis a collection with documents in French and English that we aredealing with digitized collections and that there is a fair amount ofregional representation not based upon population (note thatldquoontariordquo is less frequent than both ldquonewfoundlandrdquo ldquonova scotiardquoand ldquoalbertardquo although ldquoqueacutebecrdquo and ldquoquebecrdquo is not terribly promi-nent) Certain kinds of history dominate communities familynames the built form (houses churches schools and so forth) At avery macro level we can learn several characteristics before doing anydeeper investigation

These forms of visualization need to be used with extreme cau-tion however First while visually attractive it leaves a severeamount of ambiguity are ldquoontariordquo and ldquoqueacutebecrdquo equally promi-nent for example or is there a slight divergence Even moreproblematically we cannot learn context from the above picturedoes the word ldquoschoolrdquo refer to built structures or educational expe-riences The same question with applies to ldquochurchrdquo Do mentionsof ldquonewfoundlandrdquo refer to historical experiences there or withexpatriates or in travel narratives Word clouds are useful first stepsand have their place but need to be used with extreme caution Formore scholarly applications tables offer much more precise inter-pretation As the second figure below demonstrates they can bemore helpful when dealing with the individual sites cuttingthrough ambiguity The figure below is a visualization of the web-site ldquopast and presentrdquo Despite the ambiguous title we can quicklylearn what this website contains from this visualization Wordle isthus a useful first step analogous to taking magically taking a quicklook through the side of an archival box We still need more infor-mation however to really make this form of textual analysishelpful To navigate this problem I have created my own dynamicfinding aid program using Mathematica which scales to any born-digital web collection

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Two approaches were taken First a textual representation of wordand phrase frequency allows us to determine the content of a given filerelatively quickly I wrote a short program to extract word frequency(stop words excluded) and phrase frequency For phrases we speak ofn-grams bigrams trigrams quadgrams and fivegrams although onecould theoretically do any higher number A bigram can be a combina-tion of two characters such as lsquobirsquo two syllables or mdash pertinent in ourcase mdash two words An example ldquocanada isrdquo A trigram is three wordsquadgram is four words and a fivegram is five words and so on

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Word Cloud of the File ldquoPast and Presentrdquo

A viewer for n-grams in the CDC Collection

The CDC collection demonstrates this approachrsquos utility Usingmy Mathematica textual CDC browser pictured above I randomlyexplored file 267 the ambiguously named ldquopast and presentrdquo web-site This is one of the largest individual websites 122363 words orabout 493 standard typed pages Such an ambiguous name could beabout almost anything Initially I decided to study word frequency

ldquoldquoalbertardquo 758rdquo ldquoldquolistenrdquo 490rdquo ldquoldquosettlersrdquo 467rdquoldquoldquonewrdquo 454rdquo ldquoldquocanadardquo 388rdquo ldquoldquolandrdquo 362rdquo ldquoldquoreadrdquo328rdquo ldquoldquopeoplerdquo 295rdquo ldquoldquosettlementrdquo 276rdquo ldquoldquochineserdquo268rdquo ldquoldquocanadianrdquo 264rdquo ldquoldquoplacerdquo 255rdquo ldquoldquoheritagerdquo254rdquo ldquoldquoedmontonrdquo 250rdquo ldquoldquoyearsrdquo 238rdquo ldquoldquowestrdquo 236rdquoldquoldquoraymondrdquo 218rdquo ldquoldquoukrainianrdquo 216rdquo ldquoldquocamerdquo 213rdquoldquoldquocalgaryrdquo 209rdquo ldquoldquonamesrdquo 206rdquo ldquoldquoranchrdquo 204rdquoldquoldquoupdatedrdquo 193rdquo ldquoldquohistoryrdquo 190rdquo ldquoldquocommunitiesrdquo183rdquo hellip

Rather than having to visit the website at a glance we can almostimmediately see what the site is about More topics become apparentwhen one looks at trigrams for example in this automatically gen-erated output of generated top phrases

ldquoldquofirstrdquo ldquopeoplerdquo ldquosettlersrdquo 100rdquo ldquoldquotherdquo ldquounitedrdquoldquostatesrdquo 72rdquo ldquoldquoonerdquo ldquoofrdquo ldquotherdquo 59rdquo ldquoldquoofrdquo ldquotherdquoldquowestrdquo 58rdquo ldquoldquosettlersrdquo ldquolastrdquo ldquoupdatedrdquo 56rdquo ldquoldquopeoplerdquoldquosettlersrdquo ldquolastrdquo 56rdquo ldquoldquoalbertansrdquo ldquofirstrdquo ldquopeoplerdquo 56rdquoldquoldquoadventurousrdquo ldquoalbertansrdquo ldquofirstrdquo 56rdquo ldquoldquotherdquo ldquobarrdquoldquourdquo 55rdquo ldquoldquoopeningrdquo ldquoofrdquo ldquotherdquo 54rdquo ldquoldquocommunitiesrdquoldquoadventurousrdquo ldquoalbertansrdquo 54rdquo ldquoldquonewrdquo ldquocommunitiesrdquoldquoadventurousrdquo 54rdquo ldquoldquobarrdquo ldquourdquo ldquoranchrdquo 53rdquo ldquoldquolistenrdquoldquotordquo ldquotherdquo 52rdquo hellip

Through a few clicks we can navigate and determine the basiccontours mdash from a distance mdash of what the site pertains to In aviewer you can use a slider to dynamically see more informationor less information quickly toggling between bigrams trigramsand quadgrams allowing you to quickly hone in on immediately

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

relevant information This can be useful but we need to moveback even further to achieve a distant reading of this born-digitalcollection

If one zooms back out to this very distant level we can extractn-gram data from the entire collection We have two approachesfirst aggregate appearances of a given term (if ldquoaboriginalrdquo appears457 times then n=457) or second relative appearances of a giventerm (if ldquoaboriginalrdquo appears in 45 of 384 sites then n=45384 or01171875) In my other work on music lyrics the latter optionenables you to control for the lsquochorusrsquo effect as the former can be drowned out by one song with repetitive words Given thenature of this collection I elected for the former approach Whatcan we learn from trigrams for example After an initial set of trigrams pertaining to the Louisbourg digitization project whichhas boilerplate language on each page as well as common ldquostop-wordrdquo equivalents of phrases (especially pronounced in French) we come across the first set of meaningful phrases excerptedbelow

ldquoexperiencerdquo ldquoinrdquo ldquotherdquo 919 ldquotransferredrdquo ldquotordquo ldquotherdquo918 ldquofundedrdquo ldquobyrdquo ldquotherdquo 918 ldquoofrdquo ldquotherdquo ldquoprovincialrdquo915 ldquothisrdquo ldquowebrdquo ldquositerdquo 915 ldquo409rdquo ldquoregistryrdquo ldquo2rdquo914 ldquoarerdquo ldquonordquo ldquolongerrdquo 914 ldquogouvernementrdquo ldquodurdquo ldquocanadardquo 910 ldquouserdquo ldquothisrdquo ldquolinkrdquo 887ldquopagerdquo ldquoorrdquo ldquoyourdquo 887 ldquofollowingrdquo ldquopagerdquo ldquoorrdquo 887 ldquoautomaticallyrdquo ldquotransferredrdquo ldquotordquo 887 ldquoberdquoldquoautomaticallyrdquo ldquotransferredrdquo 887 ldquowillrdquo ldquoberdquo ldquoautomaticallyrdquo 887 ldquoindustryrdquo ldquoyourdquo ldquowillrdquo 887ldquomultimediardquo ldquoindustryrdquo ldquoyourdquo 887 ldquopracticalrdquo ldquoexperiencerdquo ldquoinrdquo 887 ldquogainedrdquo ldquopracticalrdquo ldquoexperiencerdquo887 ldquotheyrdquo ldquogainedrdquo ldquopracticalrdquo 887 ldquowhilerdquo ldquotheyrdquoldquogainedrdquo 887 ldquocanadiansrdquo ldquowhilerdquo ldquotheyrdquo 887ldquoyoungrdquo ldquocanadiansrdquo ldquowhilerdquo 887 ldquoshowcaserdquo ldquoofrdquoldquoworkrdquo 887 ldquoexcellentrdquo ldquoshowcaserdquo ldquoofrdquo 887 hellip

Here we see an array of common language funding informa-tion link information project information spread across over 887

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

websites Reading down we see information about governments jobexperience project information initiatives re-structuring and soforth At a glance we can see what language is common to all web-sites as opposed to unique information Moving down we find themore common but less universal ldquothe history ofrdquo for example ldquoher-itage community foundationrdquo libraries and so forth By the end ofthe list we can move into the unique or very rare phrases that appearonly once twice or three times

Moving beyond word and phrase frequency there are a fewmore sophisticated off-the-shelf options that can help us make senseof large arrays of unstructured data Topic modeling is among themost developed as Princetonrsquos David M Blei (a Computer Scienceprofessor) describes in a general introduction to the topic

While more and more texts are available online we simplydo not have the human power to read and study them toprovide the kind of [traditional] browsing experiencedescribed above To this end machine learning researchershave developed probabilistic topic modeling a suite of algo-rithms that aim to discover and annotate large archives ofdocuments with thematic information Topic modelingalgorithms are statistical methods that analyze the words ofthe original texts to discover the themes that run throughthem how those themes are connected to each other andhow they change over time hellip Topic modeling enables usto organize and summarize electronic archives at a scalethat would be impossible by human annotation55

Using the aforementioned SEASR Meandre workbench (or viaMALLET itself and a command-line interface) we can quickly beginto run our born-digital information through the topic-modelingalgorithm The results are notable and come in two formats a wordcloud visualization as well as an XML file containing a line-by-linebreakdown of each topic Taking the aforementioned file 267 (ldquopastand presentrdquo) we can begin to get a sense of the topics this large web-site discusses

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

At a glance we can learn about some of the main themes the struc-ture of Alberta (Alberta towns bridges lakes) the geographic shapeof the area (life land lands areas links grass and interestingly lan-guage) the stories (story knight archives but also interestinglyknight and wife) and so forth The word cloud gives us a sense ofthe most relevant terms that define each topic but we also have a3200-line XML file broken into topics which lets us further refineour inquiry

A last set of examples will help us move from the aggregate levelof hundreds of thousands of lines of text to the individual line inessence creating our own search engine to navigate this collection

There are a few good reasons why we should not use Google todo all of this in our own work First we do not know how Googleworks precisely It is a proprietary black box While the original algo-rithm is available the actual methods that go into determining theranking of search terms is a trade secret Second Google is optimizedfor a number of uses mainly to deliver the most relevant informa-tion to the user Historians and archivists are looking often forspecific information within old websites and it behooves us to exam-ine them through the lens of more specialized tools With thesehesitations in mind I have sought to develop my own rudimentarysearch engine for this born-digital collection

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A two-pronged approach can take us to relevant informationquickly In the following example I am carrying out a project onblack history We quickly search all of the CDC word frequency filesfor the occurrence of the word ldquofreedomrdquo ldquoblackrdquo and ldquolibertyrdquo(these terms of course require some background knowledge thesearch is shaped by the user) obtain a list and then map it onto achart in order to quickly visualize their locations in the collection Forexample in the graph above we see a visualization of a given term inthis case our three search terms As one moves the mouse over eachbar the frequency data for a given term appears We can also zoomand enhance this graph to use it as we wish A demonstration of thiscan be found online at wwwianmilliganca20130118cdc-tool-1

We can then move down into the data In the example aboveldquo047rdquo labeled ldquoblackgoldtxtrdquo contains 31 occurrences of the wordldquoontariordquo Moving to my Mathematica keyword-in-context routinewe can see where these occurrences of the word appear

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

A snapshot of the dynamic finding aid tool

We see that this website focuses mostly on southwestern Ontariowhere many oil fields are If we move forward we can see that it per-tains to the nineteenth century a particular region as well assubsequent memorialization activities (mention of a museum forexample) All this can be done from within the textual client asopposed to going out into individual websites Broad contoursemerge from this distant reading but they also facilitate deeper read-ing by allowing us to promptly realize critical sources

Another model through which we can visualize documents isthrough a viewer primarily focusing on term document frequencyand inverse frequency terms discussed below To prepare this mater-ial we are combining our earlier data about word frequency (that theword ldquoexperiencerdquo appears 919 times perhaps) with aggregate infor-mation about frequency in the entire corpus (ie the wordldquoexperiencerdquo appears 1766 times) The number of times that a termappears in a given website can be defined as tf (term frequency)Another opportunity is inverse document frequency where we canlearn what words are most distinct and unique in a given documentThe formula for this is the oft-cited tf-idf a staple of informationretrieval theory or tf times the logarithm of the number of docu-ments divided into the number of documents where the given termappears At a glance then we can see some of the unique words con-tained within a website if we select ldquoinverse document frequencyrdquo

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

At a very quick glance one can see the major words that make this aunique document This technique can be used on any textual docu-ment

Off-the-shelf tools can facilitate a deeper yet distant reading ofindividual websites as well especially the suite of research tools onwebsites such as Many Eyes and Voyant Using these tools howeverrequires that one puts individual data into each separate programrather than quickly navigating between them This stems from themassive amount of files and data involved Many Eyes limits files to5MB and Voyant has no hard and fast rule but can struggle withlarger files

This analysis was carried out on my personal computer55 Largerdata sets may necessitate a corresponding increase in computationalpower bringing social sciences and humanities researchers into con-tact with High Performance Computing (HPC) centres or consortiaThese extremely powerful computers provide researchers with com-puting power orders of magnitude above that of personal computersallowing results in hours instead of the weeks or months it might

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

take to process on a smaller system56 This has particular applicationin the basic processing of a very large web archive There are limita-tions however jobs must be submitted into queues in advancerather than on-site processing and perhaps most importantly HPCcentres prefer to run highly optimized code to minimize usage oftheir very expensive and scarcely allocated equipment As this level ofcoding expertise is often beyond humanities researchers researchteams must be formed

Humanities researchers would do well to familiarize themselveswith local HPC options many of which are more than happy towork on new and engaging questions These are accessible optionsin Ontario for example all but one university belongs to a HPCconsortium57 Yet as personal computing continues to dramaticallyincrease in speed and as storage prices continue to plummet textualanalysis is increasingly feasible on high-quality office systems An ini-tial exposure can generate questions and problems that if the needrears itself can subsequently be taken to the supercomputer

This case study is intended to provide one model for distantreading and the generation of finding aids in the interest of showingsome of the potential for working with born-digital sources It isintended to begin or jumpstart a conversation around the resourcesand skills necessary for the next generation A seemingly insur-mountable amount of information contained across over 73000files (as in the CDC collection) can be quickly navigated and iso-lated for researcher convenience

Conclusion Digital Literacy for the Next Generation ofHistorians

As a discipline I would like to suggest one path forward prepare thenext generation of historians at least some of them to deal with thedeluge of born-digital sources Fundamentally the challenge of born-digital sources is a pedagogical one For existing historians this is anissue of continual education for future historians an issue of profes-sional training (the annual Digital Humanities Summer Instituteheld at the University of Victoria each June is worth considering asan introduction to many digital concepts) Some off-the-shelf tools

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

can serve as gateways to deeper engagement notably the VoyantTools suite (wwwvoyant-toolsorg) and perhaps subsequently thelessons contained in the Programming Historian 2 That said digitalliteracy courses should become part of the essential pantheon of his-torical curricula This should not be mandatory but instead taughtalongside and within traditional historiography and research meth-ods courses In an era of diminishing resources and attention paid toknowledge translation and relevancy digital literacy courses are alsowell positioned to provide graduates with concrete portfolios andactionable skills If we want to be prepared to use digital method-ologies to study the post-1990 world we will need ready historians

Such a course should have six basic priorities First we must pro-vide historiographical context for new students on whatcomputational methods have been tried before How have quantita-tive and qualitative methods differed yet ultimately complementedeach other Second students in this course would need to develop anunderstanding of the digital basics cloud computing effectivebackup and versioning digital security and an understanding of howthe Internet is organized Third for effective cross-departmental andsustained engagement with digital history students must to learn howto digitally navigate conventional sources notable skill sets wouldinclude a familiarity with citation management software databasesand the possibility of using optical character recognition on a varietyof born-paper sources subsequently digitized through scanners orcameras This must also include a grasp of the scope and significanceof large digital depositories including Google Books the InternetArchive and others mentioned earlier Fourth students must learnhow to process large amounts of digital information this includes off-the-shelf visualization tools material discussed here and some of thephilosophical and methodological implications of a distance readingFifth students can learn basic programming through the ProgrammingHistorian Finally and perhaps most importantly students need anunderstanding of various ways to openly disseminate material on the web whether through blogging micro-blogging publishing webpages and so forth it is important to enable the great work that we do in the university to make its mark on the world Throughthis students will also come away from history departments with

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

comprehensive and marketable digital portfolios Such a coursewould focus more on homework than reading with special emphasisplaced on deliverable content such as websites interactive modelsand broadly implementable databases

Digital history and the readiness to work with born-digitalsources cannot stem out of pedagogy alone however Institutionalsupport is necessary While based on anecdotal evidence it appearsthat some university administrators have begun to embrace the shifttowards the digital humanities At the federal level SSHRC has iden-tified the Digital Economy as one of its priority areas58 Yet when itcomes to evaluation doctoral and post-doctoral researchers do nothave a ldquodigital humanitiesrdquo option open to them in terms of assessingcommittees leading to fears that their projects might be lost inbetween disciplinary areas of focus59 In light of the impending crisisof born-digital sources historians should encourage these initiatives

These sorts of digital projects do not always lend themselveswell to traditional publication venues The examples above from theCDC case study for example are best viewed as dynamically mal-leable models this will be one challenge as historians grapple withnew publication models This article argues first why digitalmethodologies are necessary and second how we can realize aspectsof this call The Internet is 20 years old and before historians realizeit we will be faced with the crisis of born-digital information over-load If history is to continue as the leading discipline inunderstanding the social and cultural past decisive movementtowards the digital humanities is necessary When the sea comes inwe must be ready to swim

IAN MILLIGAN is an assistant professor of Canadian history at theUniversity of Waterloo He is a founding co-editor of ActiveHistorycaan editor-at-large of the Programming Historian 2 and has written sev-eral articles in the fields of Canadian youth digital and labour history

IAN MILLIGAN est professeur adjoint en histoire canadienne agrave laUniversity of Waterloo Il est le fondateur et codirecteur du site

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Internet ActiveHistoryca ainsi qursquoun eacutediteur du site ProgrammingHistorian 2 Il a eacutecrit plusieurs articles au sujet de la jeunesse cana-dienne de la numeacuterisation et de lrsquohistoire des travailleurs

Endnotes

1 The question of when the past becomes a ldquohistoryrdquo to be professionallystudied is a tricky one Yet it is worth noting that the still-definitivework on the Baby Boom Generation appeared in 1996 with DougOwram Born at the Right Time A History of the Baby Boom Generationin Canada (Toronto University of Toronto Press 1996) This actuallyfollowed the earlier scholarly monograph Cyril Levitt Children ofPrivilege Student Revolt in the Sixties (Toronto University of TorontoPress 1984) and a host of other laudable works As academicdebatesworks on the 1960s now suggest the past becomes historysooner than we often expect

2 A critical point advanced in the field of literary criticism and theory byStephen Ramsay Reading Machines Towards an Algorithmic Criticism(Urbana University of Illinois Press 2011)

3 See ldquoIdleNoMore Timelinerdquo ongoingwwwidlenomoremakookcatimeline44 ltviewed 18 January 2013gt

4 These issues have been discussed elsewhere albeit in either a moreabridged form or with a focus on digitized historical sources A notableexception is Daniel J Cohen and Roy Rosenzweig Digital History AGuide to Gathering Preserving and Presenting the Past on the Web(Philadelphia University of Pennsylvania Press 2005) specifically theirchapter on ldquoCollecting History onlinerdquo Their focus is on the preservation-ist and archival challenges however Important notes are also found inDan Cohen ldquoHistory and the Second Decade of the Webrdquo RethinkingHistory 8 no 2 (June 2004) 293ndash301 also available online atwwwchnmgmueduessays-on-history-new-mediaessaysessayid=34

5 Figure taken from Jose Martinez ldquoTwitter CEO Dick Costolo Reveals Staggering Number of Tweets Per Dayrdquo ComplexTech wwwcomplexcomtech201210twitter-ceo-dick-costolo-reveals-staggering-number-of-tweets-per-day ltviewed 6 January 2013gtFurther more while a tweet itself is only 140 characters every tweetcomes with over 40 lines of metadata date information author biogra-phy and location information when the account was created the usersrsquofollowing and follower numbers country data and so forth This is aremarkable treasure trove of information for future historians but also apotential source of information overload See Sarah Perez ldquoThis is What

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

a Tweet Looks Likerdquo ReadWriteWeb (19 April 2010) wwwreadwritewebcomarchivesthis_is_what_a_tweet_looks_likephpltviewed 6 January 2013gt

6 This is further discussed in Ian Foster ldquoHow Computation ChangesResearchrdquo in Switching Codes Thinking Through Digital Technology inthe Humanities and the Arts eds Thomas Bartscherer and RoderickCoover (Chicago University of Chicago Press 2011) esp 22ndash4

7 Franco Moretti Graphs Maps Trees Abstract Models for Literary History(New York Verso 2005) 3ndash4

8 While not of prime focus here the longue dureacutee deserves definition Itstands opposite the history of events of instances covering instead avery long time span in an interdisciplinary social science framework In along essay Braudel noted that one can then see both structural crises ofeconomies and structures that constrain human society and develop-ment For more see Fernand Braudel ldquoHistory and the Social SciencesThe Longue Dureacuteerdquo in Fernand Braudel On History trans SarahMatthews (Chicago University of Chicago Press 1980) 27 The essaywas originally printed in the Annales ESC no 4 (OctoberndashDecember1958)

9 A great overview can be found at ldquoText Analysisrdquo Tooling Up for DigitalHumanities 2011 wwwtoolingupstanfordedupage_id=981 ltviewed19 December 2012gt

10 Jean-Baptiste Michel Yuan Kui Shen Aviva Presser Aiden Adrian VeresMatthew K Gray The Google Books Team Joseph P Pickett DaleHoiberg Dan Clancy Peter Norvig Jon Orwant Steven Pinker MartinA Nowak Erez Lieberman Aiden ldquoQuantitative Analysis of CultureUsing Millions of Digitized Booksrdquo Science 331 (14 January 2011)176ndash82 The tool is publicly accessible at ldquoGoogle Books NgramViewerrdquo Google Books 2010 wwwbooksgooglecomngrams ltviewed11 January 2013gt

11 Anthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

12 Comment by Jean-Baptiste Michel and Erez Lieberman Aiden onAnthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

13 For more information see the Canadian Century Research InfrastructureProject website at wwwccriuottawacaCCRIHomehtml and the Programme de recherche en deacutemographie historique at wwwgeneologyumontrealcafrleprdhhtm

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

14 ldquoWith Criminal Intentrdquo CriminalIntentorg wwwcriminalintentorgltviewed 20 December 2012gt

15 For an overview see ldquoAward Recipients mdash 2011rdquo Digging into DataWebpage wwwdiggingintodataorgHomeAwardRecipients2011tabid185Defaultaspx ltviewed 20 December 2012gt

16 One can do no better than the completely free Programming Historianwhich uses entirely free and open-source programming tools to learn thebasics of computational history See The Programming Historian 2 2012-Present wwwprogramminghistorianorg ltviewed 11 January 2013gt

17 Michael Katz The People of Hamilton Canada West Family and Class ina Mid-Nineteenth-Century City (Cambridge MA Harvard UniversityPress 1975) Also see A Gordon Darroch and Michael D OrnsteinldquoEthnicity and Occupational Structure in Canada in 1871 The VerticalMosaic in Historical Perspectiverdquo Canadian Historical Review 61 no 3(1980) 305ndash33

18 Ian Anderson ldquoHistory and Computingrdquo Making History The ChangingFace of the Profession in Britain 2008 wwwhistoryacukmakinghistoryresourcesarticleshistory_and_computinghtml ltviewed 20 December2012gt

19 Christopher Collins ldquoText and Document Visualizationrdquo TheHumanities and Technology Camp Greater Toronto Area VisualizationBootcamp University of Ontario Institute of Technology Oshawa Ont21 October 2011

20 The quotations by Luther and Poe are commonly used to make thispoint For the best example see Clay Shirky Cognitive SurplusCreativity and Generosity in a Connected Age Google eBook Penguin2010 For Mumford see Lewis Mumford The Myth of the Machine vol2 The Pentagon of Power (New York Harcourt Brace 1970) 182 asquoted in Gleick The Information A History A Theory A Flood (NewYork Pantheon 2011) 404

21 Adam Ostrow ldquoCisco CRS-3 Download Library of Congress in JustOver One Secondrdquo Mashable (9 March 2010) accessed onlinewwwmashablecom20100309cisco-crs-3

22 A note on data size is necessary The basic unit of information is a ldquobyterdquo (8 bits) which is roughly a single character A kilobyte (KB) is athousand bytes a megabyte (MB) a million a gigabyte (GB) a billion aterabyte (TB) a trillion and a petabyte (PB) a quadrillion Beyond thisthere are exabytes (EB) zettabytes (ZB) and yottabytes (YB)

23 These figures can of course vary depending on compression methodschosen Their math was based on an average of 8MB per book if it wasscanned at 600dpi and then compressed There are 26 million books inthe Library of Congress For more see Mike Ashenfelder ldquoTransferringlsquoLibraries of Congressrsquo of Datardquo Library of Congress Digital Preservation

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Blog 11 July 2011wwwblogslocgovdigitalpreservation201107transferring-libraries-of-congress-of-data ltviewed 11 January 2013gt

24 Gleick 232 25 ldquoAbout the Library of Congress Web Archivesrdquo Library of Congress Web

Archiving FAQs Undated wwwlocgovwebarchivingfaqhtmlfaqs_02ltviewed 11 January 2013gt

26 ldquo10000000000000000 Bytes Archivedrdquo Internet Archive Blog 26October 2012 wwwblogarchiveorg2012102610000000000000000-bytes-archivedltviewed 19 December 2012gt

27 Jean-Baptiste Michel et al 176ndash82 Published Online Ahead of Print16 December 2010 wwwsciencemagorgcontent3316014176ltviewed 29 July 2011gt See also Scott Cleland ldquoGooglersquos PiracyLiabilityrdquo Forbes 9 November 2011 wwwforbescomsitesscottcle-land20111109googles-piracy-liability ltviewed 11 January 2013gt

28 ldquoGovernment of Canada Web Archiverdquo Library and Archives Canada 17October 2007 wwwcollectionscanadagccawebarchivesindex-ehtmlltviewed 11 January 2013gt

29 For background see Bill Curry ldquoVisiting Library and Archives inOttawa Not Without an Appointmentrdquo Globe and Mail 1 May 2012wwwtheglobeandmailcomnewspoliticsottawa-notebookvisiting-library-and-archives-in-ottawa-not-without-an-appointmentarticle2418960 ltviewed 22 May 2012gt and the announcement ldquoLAC Begins Implementation of New Approach to service Deliveryrdquo Library and Archives Canada February 2012wwwcollectionscanadagccawhats-new013-560-ehtml ltviewed 22May 2012gt For professional responses see the ldquoCHArsquos comment onNew Directions for Library and Archives Canadardquo CHA Website March2012 wwwcha-shccaenNews_39items19html ltviewed 19December 2012gt and ldquoSave Library amp Archives Canadardquo CanadianAssociation of University Teachers launched AprilMay 2012 wwwsavelibraryarchivesca ltviewed 19 December 2012gt

30 Ian Milligan ldquoThe Smokescreen of lsquoModernizationrsquo at Library andArchives Canadardquo ActiveHistoryca 22 May 2012wwwactivehistoryca201205the-smokescreen-of-modernization-at-library-and-archives-canada ltviewed 19 December 2012gt

31 ldquoSave LAC May 2012 Campaign Updaterdquo SaveLibraryArchivesca May2012 wwwsavelibraryarchivescaupdate-2012-05aspx ltviewed 19December 2012gt

32 The challenges and opportunities presented by such projects is extensively discussed in Daniel J Cohen ldquoThe Future of Preserving the Pastrdquo CRM The Journal of Heritage Stewardship 2 no 2

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

(Summer 2005) 6ndash19 also available freely online at the RoyRosenzweig Center for History and New Mediawwwchnmgmueduessays-on-history-new-mediaessaysessayid=39For more see ldquoSeptember 11 Digital Archive Saving the Histories of September 11 2001rdquo Centre for History and New Media andAmerican Social History Project Center for Media and Learningwww911digitalarchiveorg ltviewed 11 January 2013gt ldquoHurricaneDigital Memory Bank Collecting and Preserving the Stories of Katrina and Ritardquo Center for History and New Media wwwhurricanearchiveorg ltviewed 11 January 2013gt and ldquoOccupyArchive Archiving the Occupy Movements from 2011rdquo Centre forHistory and New Media wwwoccupyarchiveorg ltviewed 11 January2013gt

33 Daniel Cohen and Roy Rosenzweig Digital History A Guide to Gathering Preserving and Presenting the Past on the Web(Philadelphia Temple University Press 2005) chapter on ldquoPreservingDigital Historyrdquo The chapter is available fully online atwwwchnmgmuedudigitalhistorypreserving5php ltviewed 19December 2012gt

34 For an overview see ldquoPreserving Our Digital Heritage The NationalDigital Information Infrastructure and Preservation Program 2010 Report a Collaborative Initiative of the Library of CongressrdquoDigitalpreservationgov wwwdigitalpreservationgovmultimediadocumentsNDIIPP2010Report_Postpdf ltviewed 19 December 2012gt

35 Steve Meloan ldquoNo Way to Run a Culturerdquo Wired 13 February 1998available online via the Internet Archivewwwwebarchiveorgweb20000619001705httpwwwwiredcomnewsculture012841030100html ltviewed 19 December 2012gt

36 Some of this has been discussed in greater depth on my website See theseries beginning with Ian Milligan ldquoWARC Files A Challenge forHistorians and Finding Needles in Haystacksrdquo ianmilliganca 12December 2012 wwwianmilliganca20121212warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks ltviewed 20 December2012gt

37 Unstructured data as opposed to structured data is important for histo-rians but brings additional challenges Some projects have producedstructured data novels marked up using XML encoding for example soone can immediately see place names dates character names and soforth This makes it much easier for a computer to read and understand

38 See William Noblett ldquoDigitization A Cautionary Talerdquo New Review ofAcademic Librarianship 17 (2011) 3 Tobias Blanke Michael Bryantand Mark Hedges ldquoOcropodium Open Source OCR for Small-ScaleHistorical Archivesrdquo Journal of Information Science 38 no 1 (2012) 83

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 4: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

thumb for when a topic becomes ldquohistoryrdquo it is worth noting as anexample that it took less than 30 years after the tumultuous year of1968 for a varied developed and contentious North American histo-riography to appear on the topic of life in the 1960s1 In 2021 it willhave been a similar 30 years since in August 1991 Tim Berners-Leea fellow at the European Organization for Nuclear Research (CERN)published the very first website and launched the World Wide WebProfessional historians need to be ready to add and develop new skillsto deal with sources born in a digital age

Digital sources necessitate a rethinking of the historianrsquos toolkitBasic training for new historians requires familiarity with newmethodological tools and making resources for the acquisition ofthese tools available to their mentors and teachers Not only willthese tools help in dealing with more recent born-digital sources butthey will also help historians capitalize on digitized archival sourcesfrom our more-distant past This form of history will not replace ear-lier methodologies but instead play a meaningful collaborative andsupportive role It will be a form of distant reading encompassingthousands or tens of thousands of sources that will complement themore traditional and critically important close reading of smallbatches of sources that characterizes so much of our work Indeedmuch of what we think of as ldquodigital historyrdquo may simply be ldquohis-toryrdquo in the years to come2

There are a number of avenues along which this transformationwill take place First it is important to establish a map that situatesthis emerging field into what has come before Second I examine thecurrent situation of born-digital sources and historical practiceThird I use an early born-digital archive from Library and ArchivesCanada (LAC) Canadarsquos Digital Collections project (CDC) to eval-uate how we access and make sense of digital information today Ithen apply emergent visualization techniques to the sources in thisarchive to show one way through which historians can tackle born-digital source issues Lastly I will lay out a road map to ensure thatthe next generation has enough basic digital literacy to ensure thathistorians can adequately analyze this new form of source material

Imagine a future historian taking on a central question of socialand cultural life in the middle of the first decade of the twenty-first

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

century such as how Canadians understood Idle No More or theOccupy movement through social media What would her archivelook like I like to imagine it as boxes stretching out into the dis-tance tapering off without any immediate reference points to bringit into perspective Many of these digital archives will be withouthuman-generated finding aids would have perhaps no dividers orheadings within ldquofilesrdquo and certainly no archivist with a compre-hensive grasp of the collection contents A recent example isillustrative During the IdleNoMore protests Twitter witnessed anastounding 55334 tweets on 11 January 2013 Each tweet can be up to 140 characters To put that into perspective this article is lessthan 900 lines long in a word processor Yet this massive amount ofinformation looks cryptic strange usernames no archives no fold-ers not even an immediate way to see whether a given tweet isrelevant to your research or not3 This is a vast collection Informationoverload

This example is indicative of the issues facing the historical pro-fession4 I am not exaggerating Every day half a billion tweets aresent5 hundreds of thousands of comments are uploaded to newssites and scores of blog posts are uploaded This makes up one smallaspect of broader data sets automated logs including climate read-ings security access reports library activity search patterns booksand movies ordered from Amazonca or Netflix6 Even if only a smallfraction of this is available to future historians it will represent anunparalleled source of information about how life was lived (orthought to have been lived) during the early-twenty-first century Toprovide another perhaps more tangible example while carrying outmy previous work on youth cultures in 1960s English-Canada I wasstruck by how inaccessible television sources were compared to news-paper records despite television being arguably the primary media ofthe time Unlike the 1960s however the Internet leaves a plethoraof available mdash if difficult to access mdash primary sources for historiansIn the digital age archiving will be different which means that thehistory we write will be different

We must begin planning for this plethora of born-digital sourcesnow Using modern technology we have the ability to quickly findneedles in digital haystacks to look inside a lsquoboxrsquo and immediately

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

find that relevant letter rather than spending hours searching andreading everything Yet we also have the ability to pull our gaze backdistantly elucidating the context to our relevant documents as well

The Necessity of Distant Reading Historians andComputational History

Franco Moretti in his 2005 groundbreaking work of literary criti-cism Graphs Maps Trees called on his colleagues to rethink theircraft With the advent of mass produced novels in nineteenth-cen-tury Britain Moretti argued that scholars must change theirdisciplinary approach Traditionally literary scholars worked with acorpus of around 200 novels though an impressive number for theaverage person it represents only around one percent of the totaloutput of that period To work with all of this material in an effortto understand these novels as a genre critics needed to approachresearch questions in a different way

[C]lose reading wonrsquot help here a novel a day every day ofthe year would take a century or so hellip And itrsquos not evena matter of time but of method a field this large cannotbe understood by stitching together separate bit of knowl-edge about individual cases because it isnrsquot a sum ofindividual cases itrsquos a collective system that should begrasped as such as a whole7

Instead of close reading Moretti called for literary theorists to prac-tice ldquodistant readingrdquo

For inspiration Moretti drew on the work of French Annaleshistorian Fernand Braudel Braudel amongst the most instrumentalhistorians of the twentieth century pioneered a similarly distantapproach to history his inquiries spanned large expanses of time andspace that of civilizations the Mediterranean world as well assmaller (only by comparison) regional histories of Italy and FranceHis distant approach to history did not necessitate disengagementwith the human actors on the ground seeing instead that beyond thenarrow focus of individual events lay the constant ebb and flow of

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

poverty and other endemic structural features8 While his methodol-ogy does not apply itself well to rapidly changing societies (his focuswas on the long-term slow change) his points around distant read-ing and the need for collaborative interdisciplinary research are auseful antecedent Similar opportunities present themselves whendealing with the vast array of born-digital sources (as well as digitizedhistorical sources) we can distantly read large arrays of sources froma longue dureacutee of time and space mediated through algorithms andcomputer interfaces opening up the ability to consider the voices ofeverybody preserved in the source

Historians must be open to the digital turn due to the astound-ing growth of digital sources and an increasing technical ability toprocess them on a mass scale Social sciences and humanitiesresearchers have begun to witness a profound transformation in howthey conduct and disseminate their work Indeed historical digitalsources have reached a scale where they defy conventional analysisand now require computational analysis Responding to thisarchives are increasingly committed to preserving cultural heritagematerials in digital forms

I would like to pre-emptively respond to two criticisms of thisdigital shift within the historical profession First historians will notall have to become programmers Just as not all historians need a firmgrasp of Geographical Information Systems (GIS) or a developedunderstanding of the methodological implications of community-based oral history the incorporation of a transnational perspective orin-depth engagement with cutting edge demographic models not allhistorians have to approach their trade from a computational per-spective Nor should they Digital history does not replace closereading traditional archival inquiry or going into communities mdash touse only a few examples mdash to uncover notions of collective memoryor trauma Indeed digital historians will play a facilitative role andprovide a broader reading context Yet they will still be historians col-lecting relevant sources analyzing and contextualizing them situatingthem in convincing narratives or explanatory frameworks and dis-seminating their findings to wider audiences

Neither will historians replace professional programmers Insteadas with other subfields historians using digital sources will need to be

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

prepared to work on larger truly interdisciplinary teams Computer sci-entists statisticians and mathematicians for example have criticalmethodological and analytical contributions to make towards questionsof historical inquiry Historians need to forge further links with thesedisciplines Such meaningful collaboration however will require histo-rians who can speak their languages and who have an understanding oftheir work Some of us must be able to discuss computational or algo-rithmic methodologies when we collectively explore the past

The methodologies discussed in this paper fall into the twinfields of data mining and textual analysis The former refers to thediscovery of patterns or information about large amounts of infor-mation while the latter refers to the use of digital tools to analyzelarge quantities of text9 The case study here draws upon three mainconcepts information retrieval (quickly finding what you want)information and term frequency (how often do various concepts andterms appear) and topic modeling (finding frequently occurringbags of words throughout documents) Other projects have involvedauthorship attribution which is finding out the author of docu-ments based on previous patterns style analysis or creating networkgraphs of topics groups documents or individuals

The Culturomics project run by the Cultural Observatory atHarvard University is a notable model of collaboration on a big his-tory project As laid out in Science and put online as the GoogleBooks n-gram viewer (wwwbooksgooglecomngrams) this projectindexed word and phrase frequency across over five million booksenabling researchers to trace the rise and fall of cultural ideas andphenomena through targeted keyword and phrase searches10 This isthe most ambitious and certainly the most advanced and accessibleBig History project yet launched Yet amongst the extensive authorlist (13 individuals and the Google Books team) there were no his-torians This suggests that other disciplines in the humanities andsocial sciences have embraced the digital turn far more than histori-ans However as I mentioned in the coming decades historians willincreasingly rely on digital archives There is a risk that history as aprofessional discipline could be left behind

American Historical Association President Anthony Graftonnoted this absence in a widely circulated Perspectives article Why was

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

there no historian he asked when this was both primarily a histori-cal project and a team comprised of varied doctoral holders fromdisciplines such as English literature psychology computer sciencebiology and mathematics12 Culturomics project leaders ErezLeiberman Aiden and Jean-Baptiste Michel responded franklyWhile the team approached historians and some did play advisoryroles every author listed ldquodirectly contributed to either the creationof the collection of written texts (the lsquocorpusrsquo) or to the design andexecution of the specific analyses we performed No academic histo-rians met this barrdquo They continued to indict the profession

The historians who came to the meeting were intelligentkind and encouraging But they didnrsquot seem to have agood sense of how to wield quantitative data to answerquestions didnrsquot have relevant computational skills anddidnrsquot seem to have the time to dedicate to a big multi-author collaboration Itrsquos not their fault these things donrsquotappear to be taught or encouraged in history departmentsright now13

This is a serious indictment considering the pending archival shift todigital sources and one that still largely holds true (a few historydepartments now offer digital history courses but they are still out-liers) This form of history the massive analysis of digitizedinformation will continue Furthermore these approaches willattract public attention as they lend themselves well to accessibleonline deployment Unless historians are well positioned they riskceding professional ground to other disciplines

While historians were not involved in Culturomics they arehowever involved in several other digital projects Two major andambitious long-term Canadian projects are currently using compu-tational methods to reconstruct the demographic past The first isthe Canadian Century Research Infrastructure project which is cre-ating databases from five censuses and aiming to provide ldquoa newfoundation for the study of social economic cultural and politicalchangerdquo while the second is the Programme de recherche en deacutemo-graphie historique hosted by the Universiteacute de Montreacuteal13 They are

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

reconstructing the European population of Quebec in the seven-teenth and eighteenth centuries drawing heavily on parish registersFurthermore several projects in England have harnessed the wave ofinterest in ldquobig datardquo to tell compelling historical stories about theirpasts the Criminal Intent project for example used the proceedingsof the Old Bailey courts (the main criminal court in Greater London)to visualize large quantities of historical information14 The CriminalIntent project was itself funded by the Digging into Data Challengewhich brings together multinational research teams and fundingfrom federal-level funding agencies in Canada the United States theUnited Kingdom and beyond Several other historical studies havebeen funded under their auspices from studies of commodity tradein the eighteenth- and nineteenth-century Atlantic world (involvingscholars at the University of Edinburgh the University of StAndrews and York University) to data mining the degrees of eco-nomic opportunity and spatial mobility in Britain Canada and theUnited States (bringing researchers at the University of Guelph theUniversity of Leicester and the University of Minnesota together)15

These exciting interdisciplinary teams may be heralding a disciplinalshift although it is still early days

Collaborative teams have limitations though We will have tobe prepared to tackle digital sources alone on shoestring budgetsHistorians will not always be able to call on external programmers orspecialists to do work because teams often develop out of fundinggrants As governmental and university budgets often prove unableto meet consistent research needs baseline technical knowledge isrequired for independent and unfunded work If historians relyexclusively on outside help to deal with born-digital sources the abil-ity to do computational history diminishes As the success of theProgramming Historian indicates and as I will demonstrate in mycase study below historians can learn basic techniques16 Let ustweak our own algorithms

A social or cultural history of the very late-twentieth or twenty-first century will have to account for born-digital sources andhistorians will (in conjunction with archival professionals) have tocurate and make sense of the data Historians need to be able toengage with these sources themselves or at least as part of a team

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

rather than having sources mediated through the lens of outside dis-ciplines or commercial interests Algorithms may be one meaningfulway to make sense of cultural trends or extract meaningful informa-tion from billions of tweets but we must ensure professionalhistorical input goes into their crafting

The Third Wave of Computational History

Historians have fruitfully used computers in the past as part of twoprevious waves The 1960s and 1970s saw computational tools usedfor demographic population and economic histories a trend that inCanada saw full realization in comprehensive studies such as MichaelKatzrsquos The People of Hamilton Canada West17 As Ian Anderson hasconvincingly noted in a review of history and computing studies ofthis sort saw the field become associated with quantitative studies18

After a retreat from the spotlight computational history rose again inthe 1990s with the advent of the personal computer graphical userinterfaces and overall improvements in ease of use Punch cards couldnow give way to database programs GIS and early online networkssuch as H-Net and Usenet We are now I believe on the cusp of athird revolution in computational history thanks to three main fac-tors decreasing storage costs the power of the Internet anddistributed cloud computing and the rise of professionals dealingwith both digital preservation and open-source tools

Decreasing storage costs have led to massive amounts of infor-mation being preserved as Gleick noted in this articlersquos epigraph Itis important to note that this information overload is not newPeople have long worried about the impact of too much informa-tion19 In the sixteenth century responding to the rise of the printingpress the German priest Martin Luther decried that the ldquomultitudeof books [were] a great evilrdquo In the nineteenth century Edgar AllanPoe bemoaned ldquo[t]he enormous multiplication of books in everybranch of knowledge is one of the greatest evils of this agerdquo Asrecently as 1970 American historian Lewis Mumford lamented ldquotheoverproduction of books will bring about a state of intellectual ener-vation and depletion hardly to be distinguished from massiveignorancerdquo20 The rise of born-digital sources must thus be seen in

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

this continuous context of handwringing around the expansion andrise of information

Useful historical information is being preserved at a rate that isaccelerating with every passing day One useful metric to comparedataset size is that of the American Library of Congress (LOC)Indeed a ldquoLOCrdquo has become shorthand for a unit of measurementin the digital humanities Measuring the actual extant of the printedcollection in terms of data is notoriously difficult James Gleick hasclaimed it to be around 10TB22 whereas the Library of Congressitself claims a more accurate figure is about 200TB23 While thesetwo figures are obviously divergent neither is on an unimaginablescale mdash personal computers now often ship with a terabyte or moreof storage When Claude Shannon the father of information theorycarried out a thought experiment in 1949 about items or places thatmight store ldquoinformationrdquo (an emerging field which viewed messagesin their aggregate) he scaled sites from a digit upwards to the largestcollection he could imagine The list scaled up logarithmically from10 to the power of 0 to 10 to the power of 13 If a single-spaced pageof typing is 104 107 the Proceedings of the Institute of Radio Engineers109 the entire text of the Encyclopaedia Brittanica then 1014 was ldquothelargest information stockpile he could think of the Library ofCongressrdquo24

But now the Library of Congressrsquo print collection mdash somethingthat has taken two centuries to gather mdash is dwarfed by their newestcollection of archived born-digital sources the vast majority only amatter of years old compared to the much wider date-range of non-digital sources in the traditional collection The LOC has beguncollecting its own born-digital Internet archive multiple snapshotsare taken of webpages to show how they change over time Even asa selective curated collection drawing on governmental politicaleducational and creative websites the LOC has already collected254TB of data and adds 5TB a month25 The Internet Archivethrough its Wayback Machine is even more ambitious It seeks toarchive every website on the Internet While its size is also hard toput an exact finger on as of late 2012 it had over 10 Petabytes ofinformation (or to put it into perspective a little over 50 Library ofCongresses if we take the 200TB figure)26

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

This is to some degree comparing apples and oranges theLibrary of Congress is predominantly print whereas the InternetArchive has considerable multimedia holdings which includesvideos images and audio files The LOC is furthermore a morecurated collection whereas the Internet Archive draws from a widerrange of producers The Internet offers the advantages and disadvan-tages of being a more democratic archive For example a Geocitiessite (an early way for anybody to create their own website) created bya 12-year old Canadian teenager in the mid-1990s might be pre-served by the Internet Archive whereas it almost certainly would notbe saved by a national archive This difference in size and collectionsmanagement however is at the root of the changing historiansrsquotoolkit If we make use of these large data troves we can access a newand broad range of historical subjects

Conventional sources are also part of this deluge Google Bookshas been digitizing the printed word and it currently has 15 millionworks completed by 2020 it audaciously aims to digitize every bookever written While copyright issues loom the project will allowresearchers to access aggregate word and phrase data as seen in theGoogle n-gram project27 In Canada Library and Archives Canada(LAC) has a more circumscribed digitization project collecting ldquoarepresentative sample of Canadian websitesrdquo focusing particularlyon Government of Canada webpages as well as a curated approach

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

preserving case studies such as the 2006 federal election Currentlyit has 4TB of federal government information and 7TB of totalinformation in its larger web archive (a figure which as I note belowis unlikely to drastically improve despite ostensible goals of archivallsquomodernizationrsquo)28

In Canada LAC has mdash on the surface mdash has been advancingan ambitious digitization program as it realizes the challenges nowfacing archivists and historians Daniel J Caron the Librarian andArchivist of Canada has been outspoken on this front with severalpublic addresses on the question of digital sources and archivesThroughout 2012 primary source digitization at LAC preoccupiedcritics who saw it as a means to depreciate on-site access TheCanadian Association of University Teachers the Canadian HistoricalAssociation as well as other professional organizations mountedcampaigns against this shift in resources29 LAC does have a pointhowever with their modernization agenda The state of Canadarsquosonline collections are small and sorely lacking when compared totheir on-site collections and LAC does need to modernize and achievethe laudable goal of reaching audiences beyond Ottawa

Nonetheless digitization has unfortunately been used as asmokescreen to depreciate overall service offerings30 The modern-ization agenda would also see 50 percent of the digitization staffcut31 LACrsquos online collections are small they do not work withonline developers through Application Programming Interfaces(APIs a way for a computer program to talk to another computerand speed up research mdash Canadianaca is incidentally a leader in thisarea) and there have been no promising indications that this state ofaffairs will change Digitization has not been comprehensive Onlinepreservation is important and historians must fight for both on-siteand on-line access

A number of historians have recognized this challenge and areplaying an instrumental role in preserving the born-digital past TheRoy Rosenzweig Center for History and New Media at GeorgeMason University launched several high-profile preservation pro-jects the September 11th Digital Archive the Hurricane DigitalMemory Bank (focusing on Hurricanes Katrina and Rita) and as ofwriting the Occupy archive32 Massive amounts of online content

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

is curated and preserved photographs news reports blog posts andnow tweets These complement more traditional efforts of collectingand preserving oral histories and personal recollections which arethen geo-tagged (tied to a particular geographical location) tran-scribed and placed online These archives serve a dual purposepreserving the past for future generations while also facilitating easydissemination for todayrsquos audiences

Digital preservation however is a topic of critical importanceMuch of our early digital heritage has been lost Many websites espe-cially those before the Internet Archiversquos 1996 web archiving projectlaunch are completely gone Early storage mediums (from tape toearly floppy disks) have become obsolete and are now nearly inac-cessible Compounding this software development has seen earlyproprietary file formats depreciated ignored and eventually madewholly inaccessible33 Fortunately the problem of digital preservationwas recognized by the 1990s and in 2000 the American governmentestablished the National Digital Information Infrastructure andPreservation Program34 As early as 1998 digital preservationistsargued ldquo[h]istorians will look back on this era hellip and see a period ofvery little information A lsquodigital gaprsquo will span from the beginningof the wide-spread use of the computer until the time we eventuallysolve this problem What wersquore all trying to do is to shorten thatgaprdquo To bear home their successes that quotation is taken from anInternet Archive snapshot of Wired magazinersquos website35

Indeed greater attention towards rigorous metadata (which isbriefly put data about data often preserved in plain text format)open-access file formats documented by the InternationalOrganization for Standardization (ISO) and dedicated digitalpreservation librarians and archivists gives us hope for the future36

It is my belief that cloud storage the online hosting of data in largethird-party offsite centres mitigates the issue of obsolete storagemedium Furthermore that the move towards open source file for-mats even amongst large commercial enterprises such as Microsoftprovides greater documentation and uniformity

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A Brief Case Study Accessing and Navigating Canadarsquos DigitalCollections With an Emphasis on Tools for Born-DigitalSources

What can we do about this digital deluge There are no simpleanswers but historians must begin to conceptualize new additions totheir traditional research and pedagogical toolkits In the section thatfollows I will do two things introduce a case study and several toolsboth proprietary and those that are open and available off-the-shelfto reveal one way forward

For a case study I have elected to use Canadarsquos DigitalCollections (CDC) a collection of lsquodeadrsquo websites removed fromtheir original location on the World Wide Web and now digitallyarchived at LAC In this section I will introduce the collection usingdigitally-archived sources from the early World Wide Web and dis-cuss my research methodology before illustrating how we can derivemeaningful data from massive arrays of unstructured data37

This is a case study of a web archive Several of these method-ologies could be applied to other formats including large arrays oftextual material whether born-digital or not A distant reading of thistype is not medium specific It does however assume that the text ishigh quality Digitized primary documents do not always have textlayers and if they are instituted computationally through OpticalCharacter Recognition errors can appear Accuracy rates for digitizedtextual documents can range from as low as 40 percent for seven-teenth and eighteenth century documents to 80ndash90 percent formodern microfilmed newspapers to over 99 percent for typesetword-processed documents38 That said while historians today maybe more interested in applying distant reading methodologies tomore traditional primary source materials it is my belief that in thenot-so-distant future websites will be an increasingly significantsource for social cultural and political historians

What are Canadarsquos Digital Collections In 1996 IndustryCanada funded by the federal Youth Employment Strategy facili-tated the development of Canadian heritage websites by youth(defined as anyone between the age of 15 and 30) It issued contractsto private corporations that agreed to use these individuals as web

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

developers for the twofold purpose of developing digital skills andinstilling respect for and engagement with Canadian heritage andpreservation39 The collection initially called SchoolNet butrebranded as Canadarsquos Digital Collections in 1999 was accessiblefrom the CDC homepage After 2004 the program was shut downand the vast majority of the websites were archived on LACrsquos web-page This collection is now the largest born-digital Internet archivethat LAC holds worthy of study as content as well as thinkingabout methodological approaches

Conventional records about the program are unavailable in thetraditional LAC collection unsurprising in light of both its recentcreation and backlogs in processing and accessioning archival mate-rials We can use one element of our born-digital toolkit the InternetArchiversquos WaybackMachine (its archive of preserved early webpages)to get a sense of what this project looked like in its heyday The mainwebpage for the CDC was wwwcollectionsicgc However a visitortoday will see nothing but a collection of broken image links (two ofthem will bring you to the English and French-language versions ofthe LAC-hosted web archive although this is only discerniblethrough trial and error or through the sitersquos source code)

With the WaybackMachine we have a powerful albeit limitedopportunity to ldquogo back in timerdquo and experience the website as previ-ous web viewers would have (if a preserved website is available ofcourse) Among the over 420 million archived websites we can searchfor and find the original CDC site Between 12 October 1999 andthe most recent snapshot or web crawl (of the now-broken site) on 6July 2011 the site was crawled 410 times40 Unfortunately every siteversion until the 15 October 2000 crawl has been somehow cor-rupted Content has been instead replaced by Chinese characterswhich translate only into place names and other random words Issuesof early digital preservation noted above have reared their head inthis collection

From late 2000 onward however fully preserved CDC sites areaccessible Featured sites new entries success stories awards curric-ular units and prominent invitations to apply for project fundingpopulate the brightly coloured website The excitement is palpableJohn Manley then Industry Minister declared ldquoit is crucial that

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Canadarsquos young people get the skills and experience they need to par-ticipate in the knowledge-based economy of the futurerdquo and thatCDC ldquoshows that young people facing significant employment chal-lenges can contribute to and learn from the resources available onthe Information Highwayrdquo41 The documentation reflecting thelower level of technical expertise and scarcity of computer hardwareat the time was comprehensive participants were encouraged toidentify a collection digitize it process it (by cleaning up the data)preserve it on a hard drive and display it on the web Issues of copy-right were discussed as well as preservation issues and hardwarerequirements for computers (a computer with an Intel 386 processor8MB RAM a 1GB hard drive and 288bps modem was recom-mended) and training information for employers educators andparticipants42 Sample proposals community liaison ideas amidstother information set out a very comprehensive roadmap for a grass-roots bottom-up approach to digital preservation and heritage

Moving through the preserved websites is an interesting snap-shot at how web design standards changed Colour disappearsreplaced by a more subdued coordinated federal website standard(still largely in effect today) Surveys greet viewers from 2003onwards trying to learn how viewers discovered the site whetherthey think it is a good educational resource (they must have beenrelieved that the majority thought so mdash in 2003 some 84 percentfelt that it was a ldquogoodrdquo resource the best of its type)43 By 2004 theprogram was engaging in active remembrance of its past programsas it neared its ten-year anniversary Proposals stopped beingaccepted in July 2004 but the 600 webpages were still accessiblewith a comprehensive subject index alphabetical listing success sto-ries and detailed copyright information

The Canadarsquos Digital Collections site began to move towardsremoval in November 2006 and was initially completely gone A visi-tor to the site would learn that the content ldquois no longer availablerdquo andan e-mail address was provided for contributors to get their informationback44The information was completely inaccessible and by early 2007a bold notice announced that the now-empty site would close as of 2April 2007 Luckily by April 2007 a notice appeared with a declara-tion that the collections were now archived by Library and Archives

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Canada just in time for the website to completely disappear by July2007 In the archived version subject indexes and headings were goneAll that remained then and today for that matter is a straight-forwardalphabetical list and a disclaimer that noted ldquo[y]ou are viewing a doc-ument archived by Library and Archives Canada Please noteinformation may be out of date and some functionality lostrdquo

The question now is what we can do with this collection ofwebsites Could a distant reading as opposed to a close reading ona targeted level tell us something about this collectionFortunately we have a variety of tools that can help us navigate thislarge array of information several of which I use to navigate largeamounts of data First I create my own tools through computerprogramming This work is showcased in the following section I program in Wolfram Researchrsquos Mathematica 9 Mathematica is anintegrated platform for technical computing that allows users toprocess visualize and interact with exceptional arrays of informa-tion45 It is a language with an engaged and generous user base thatcomplements exhaustive well-written documentation fromWolfram Research To emphasize its ease one line of code couldimport a website in perfectly readable plaintext (avoiding unicodeerrors) as follows input=Import[ldquohttpactivehis-torycaldquordquoPlaintextrdquo] and a second one could make itall in lower text for textual analysis as such lowerpage=ToLowerCase[input] Variables are a critical building blockIn the former example ldquoinputrdquo now holds the entire plaintext of thewebsite ActiveHistoryca and then ldquolowerpagerdquo holds it in lowercaseOnce we have lower-cased information we are quickly able to extractthe following frequent words frequent phrases and could thus cre-ate a dynamic database of n-grams word clouds or run other textualanalysis programs on the material Much of the syntax adheres to arecognizable English-language format

Even more promising is Mathematicarsquos integration with theWolfram|Alpha database (itself accessible at wolframalphacom) acomputational knowledge engine that promises to put incredibleamounts of information at a userrsquos fingertips With a quick com-mand one can access the temperature at Torontorsquos Lester B PearsonInternational Airport at 500 pm on 18 December 1983 the rela-

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

tive popularity of names such as ldquoEdithrdquo or ldquoEmmardquo the Canadianunemployment rate in March 1985 or the Canadian GrossDomestic Product in December 199546 This allows one to quicklycross-reference findings with statistical information or find coinci-dental points of comparison As universities are now being activelyencouraged to contribute their information to Wolfram|Alpha thiswill become an increasingly powerful database

For many researchers the downside of Mathematica is a con-siderable one It is proprietary software and relatively expensive forstudent and non-student alike Outputs however can be dissemi-nated using the Computational Document Format (CDF) enablinghistorians to put their work online and entice the general public todynamically work with their models and findings

Programming is not the be all and end all Information visual-ization in the humanities is a growing international field withCanadian scholars heavily involved The most important and acces-sible example is Voyant Tools47 It is a powerful suite of textualanalysis tools Uploading onersquos own text or collection of texts draw-ing on a webpage or analyzing existing bodies of work one sees thefollowing a ldquoword cloudrdquo of a corpus click on any word to see itsrelative rise and fall over time (word trends) access the original textand click on any word to see its contextual placement within indi-vidual sentences It ably facilitates the ldquodistancerdquo reading of anyreasonably-sized body of textual information and should be a begin-ning point for those interested in this field The only downside isthat Voyant can choke on very large amounts of information as it isa textual analysis rather than robust data mining tool

There are several other complementary tools A notable one is theMany Eyes website hosted by IBM Research Many Eyes allows you toupload datasets be they textual tabular or numerical There are manyvisualization options provided you can plot relationships betweendata compare values (ie bar charts and bubble charts) track dataover time and apply information to maps The only downside toMany Eyes is that you must register and make your datasets available toall other registered researchers as well as IBM This limits your abilityto use copyrighted or proprietary information

More advanced technical users can avail themselves of a host of

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

textual analysis programs found elsewhere on the web The mostcomprehensive gathering of these tools can be found with theSoftware Environment for the Advancement of Scholarly Research(SEASR) and its Meandre workbench Meandre which can beinstalled as a local server on your system or mdash in a best case scenariomdash hosted by an institutional IT department allows you to tap intoa variety of workflows to analyze data48 One workflow mentionedin this article is the topic modeling toolkit found within the MachineLearning for Language Toolkit (MALLET) package which clustersunstructured data into ldquotopicsrdquo which appear together giving a quicksense of a documentrsquos structure49 Other workflows found withinSEASR include Dunningrsquos log-likelihood statistic which compares aselected document to a broader reference corpus For example youcould compare the lyrics that appear in 1968 against all lyrics in thepostwar era and learn what makes 1968 most unique50 sentimentanalyses which attempt to plot changes in emotional words andphrases over time and a variety of entity extraction algorithms(which find ldquopeoplerdquo ldquoplacesrdquo and so forth in a text) Corpuses thatwere previously too big to extract meaningful information from cannow be explored relatively quickly

As befits a technical side to history all of the freely availabletools discussed briefly here (notably wget SEASR and MALLET)are discussed in more technical terms in the peer-reviewedProgramming Historian 2 or on my own blog at ianmilliganca I haveelected to leave them online rather than provide commands in thisarticle they require some technical knowledge and directly workingfrom the web is quicker than here Please consider the following sec-tion an applied version of some of the lessons you can learn throughthese processes

From Web Collection to Dynamic Finding Aids

Let us briefly return to our archival metaphor Todayrsquos historiandealing with conventional sources often has to physically travel to anarchive requisition boxes perhaps take photographs of materials iftime is tight and otherwise take notes With born-digital sourceshowever the formula is reversed Our sources can come to us In a

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

few keystrokes we can bring an unparalleled quantity of informationto our home or work computers With Canadarsquos Digital Collectionsa single command line entry brought the entire archive onto my sys-tem it took days to process the information and I had to leave mycomputer on but it did not require any active user engagement

How big is the CDC collection in terms of sheer informationThe plaintext without any HTML markup is approximately7805135 words If each page contained 250 words it would be theequivalent of a 31220 page essay or 1000 dissertations However itis even more complicated than just reading 1000 dissertations Theinformation is spread across 73585 separate HTML files themselvespart of a much broader corpus of 278785 files Even the sheer sizeof the database requires big data methodologies an aggregate size of726 GB 360 MB of plaintext with HTML encoding reduced to alsquomerersquo 50MB if all encoding is removed A 50MB plain text docu-ment can quickly slow down even a relatively powerful personalcomputer There is no simple way to make this a conventionallyaccessible amount of information

How can we access this information Several different methodsare possible First we can work with the information on the internetitself or we can download it onto our own system I prefer the lattermethod as it decreases computational time adds flexibility andenables free rein to experiment with material Before you do thisthough you need to be familiar with the websitersquos ldquoterms of servicerdquoor the legal documentation that outlines how users are allowed to usea website resources or software In this case LAC provides generousnon-commercial reproduction terms information ldquomay be repro-duced in part or in whole and by any means without charge orfurther permission unless otherwise specifiedrdquo51

The CDC web archive index is organized like so the index islocated at wwwepelac-bacgcca100205301iccdcEAlphabetaspwith every subsequent website branching off the cdc folder Forexample one website would be hosted in wwwepelac-bacgcca100205301iccdcABbenevolat the next in hellipcdcabnatureand so on Using wget a free software program you can downloadall websites in a certain structure the program recursively downloadsinformation in a methodical fashion I first used wget to save the

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

cookies (little files kept on your system in this case one that notedthat you had seen the disclaimer about the archival nature of thewebsite) and then with an one-line command in my command lineto download the entire database52 Due to the imposition of a ran-dom wait and bandwidth limit which meant that I would downloadone file every half to one and a half seconds at a maximum of 30kilobytes a second the process took six days In the ProgrammingHistorian 2 I have an article expanding upon wget for humanitiesresearchers

Hypertext itself deserves a brief discussion The basic buildingblock of the web is the link one page is connected to another andpeople navigate through these webs While books and articles forexample are connected to each other by virtue of ideas and concretecitations these must often be inferred online HTML born-digitalsources are different in that the link offers a very concrete file One

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Visualizing the CDC Collection as a Series of Hyperlinks

quick visualization the circular cluster seen above lets us see theunique nature of this information We can go to each web pagescrape the links and see where they take viewers As we see in thebelow figure the CDC collection can be visualized as a series ofhyperlinks (each line) The cluster in the middle is the CDC collec-tion itself whereas the strands on the outside represent externalwebsites Using Mathematica I have generated a dynamic modelyou can move your mouse over each line learn what link it is andthus see where the collection fits into the broader Internet The siteis limited to following two links beyond the first page Through thisapproach we see the interconnection of knowledge and relationshipsbetween different sources At a glimpse we can see suggestions to fol-low other sites and determine if any nodes are shared amongst thevarious websites revealed

Visualization Moving Beyond Word Clouds

Where to begin with a large collection such as the Canadarsquos DigitalCollection corpus As a historian who is primarily interested in thewritten word my first step was to isolate the plaintext throughout thematerial so that I could begin rudimentary data mining and textualanalysis on the corpus For this I approached the CDC collection ontwo distinct levels mdash first the aggregate collection and second divid-ing the collection into each lsquofilersquo or individual website mdash derivingword and phrase data for each Imagine if you could look throughinto an archival collection and get a quick sense of what topics andideas were covered Taking that thought further imagine the utility ifyou could do this with individual archival boxes and even files Witha plethora of born-digital sources this is now a necessary step

On an aggregate level word and phrase data is a way to explorethe information within How often do various words within thenearly eight million words of the collection appear UsingMathematica we can quickly generate a list of words sorted by theirrelative frequency we learn that the word ldquoCanadardquo appears 26429times ldquocollectionsrdquo 11436 and ldquo[L]ouisbourgrdquo an astounding9365 times Reading such a list can quickly wear out a reader and isof minimal visual attractiveness Two quick decisions were made to

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

enhance data usability first the information was ldquocase-foldedrdquo allsent to lower-case this enables us to count the total appearances of aword such as ldquodigitalrdquo regardless of whether or not it appears at thebeginning of a sentence

Second when dealing with word frequency mdash as opposed tophrase frequency mdash ldquostop wordsrdquo should be removed as they werehere These are ldquoextremely common words which would appear to beof little value in helping select documents matching a user needrdquowords such as ldquotherdquo ldquoandrdquo ldquotordquo and so forth They matter forphrases (the example ldquoto be or not to berdquo is comprised entirely ofstop words) but will quickly dominate any word-level visualizationor list53 Other words may need to be added or removed from stoplists depending on the corpus for example if the phrase ldquopage num-berrdquo appeared on every page that would be useless information forthe researcher

The first step then is to use Mathematica to format word fre-quency information into Wordlenet readable format (if we put allof the text into that website it will crash mdash so we need to followthe instructions on that website) we can then draw up a wordcloud of the entire collection Here we can see both the advantagesand disadvantages of static word clouds What can we learn from this

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A Word Cloud of All Words in the CDC Corpus

illustration First without knowing anything about the sites we canimmediately learn that it is dealing with CanadaCanadians that itis a collection with documents in French and English that we aredealing with digitized collections and that there is a fair amount ofregional representation not based upon population (note thatldquoontariordquo is less frequent than both ldquonewfoundlandrdquo ldquonova scotiardquoand ldquoalbertardquo although ldquoqueacutebecrdquo and ldquoquebecrdquo is not terribly promi-nent) Certain kinds of history dominate communities familynames the built form (houses churches schools and so forth) At avery macro level we can learn several characteristics before doing anydeeper investigation

These forms of visualization need to be used with extreme cau-tion however First while visually attractive it leaves a severeamount of ambiguity are ldquoontariordquo and ldquoqueacutebecrdquo equally promi-nent for example or is there a slight divergence Even moreproblematically we cannot learn context from the above picturedoes the word ldquoschoolrdquo refer to built structures or educational expe-riences The same question with applies to ldquochurchrdquo Do mentionsof ldquonewfoundlandrdquo refer to historical experiences there or withexpatriates or in travel narratives Word clouds are useful first stepsand have their place but need to be used with extreme caution Formore scholarly applications tables offer much more precise inter-pretation As the second figure below demonstrates they can bemore helpful when dealing with the individual sites cuttingthrough ambiguity The figure below is a visualization of the web-site ldquopast and presentrdquo Despite the ambiguous title we can quicklylearn what this website contains from this visualization Wordle isthus a useful first step analogous to taking magically taking a quicklook through the side of an archival box We still need more infor-mation however to really make this form of textual analysishelpful To navigate this problem I have created my own dynamicfinding aid program using Mathematica which scales to any born-digital web collection

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Two approaches were taken First a textual representation of wordand phrase frequency allows us to determine the content of a given filerelatively quickly I wrote a short program to extract word frequency(stop words excluded) and phrase frequency For phrases we speak ofn-grams bigrams trigrams quadgrams and fivegrams although onecould theoretically do any higher number A bigram can be a combina-tion of two characters such as lsquobirsquo two syllables or mdash pertinent in ourcase mdash two words An example ldquocanada isrdquo A trigram is three wordsquadgram is four words and a fivegram is five words and so on

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Word Cloud of the File ldquoPast and Presentrdquo

A viewer for n-grams in the CDC Collection

The CDC collection demonstrates this approachrsquos utility Usingmy Mathematica textual CDC browser pictured above I randomlyexplored file 267 the ambiguously named ldquopast and presentrdquo web-site This is one of the largest individual websites 122363 words orabout 493 standard typed pages Such an ambiguous name could beabout almost anything Initially I decided to study word frequency

ldquoldquoalbertardquo 758rdquo ldquoldquolistenrdquo 490rdquo ldquoldquosettlersrdquo 467rdquoldquoldquonewrdquo 454rdquo ldquoldquocanadardquo 388rdquo ldquoldquolandrdquo 362rdquo ldquoldquoreadrdquo328rdquo ldquoldquopeoplerdquo 295rdquo ldquoldquosettlementrdquo 276rdquo ldquoldquochineserdquo268rdquo ldquoldquocanadianrdquo 264rdquo ldquoldquoplacerdquo 255rdquo ldquoldquoheritagerdquo254rdquo ldquoldquoedmontonrdquo 250rdquo ldquoldquoyearsrdquo 238rdquo ldquoldquowestrdquo 236rdquoldquoldquoraymondrdquo 218rdquo ldquoldquoukrainianrdquo 216rdquo ldquoldquocamerdquo 213rdquoldquoldquocalgaryrdquo 209rdquo ldquoldquonamesrdquo 206rdquo ldquoldquoranchrdquo 204rdquoldquoldquoupdatedrdquo 193rdquo ldquoldquohistoryrdquo 190rdquo ldquoldquocommunitiesrdquo183rdquo hellip

Rather than having to visit the website at a glance we can almostimmediately see what the site is about More topics become apparentwhen one looks at trigrams for example in this automatically gen-erated output of generated top phrases

ldquoldquofirstrdquo ldquopeoplerdquo ldquosettlersrdquo 100rdquo ldquoldquotherdquo ldquounitedrdquoldquostatesrdquo 72rdquo ldquoldquoonerdquo ldquoofrdquo ldquotherdquo 59rdquo ldquoldquoofrdquo ldquotherdquoldquowestrdquo 58rdquo ldquoldquosettlersrdquo ldquolastrdquo ldquoupdatedrdquo 56rdquo ldquoldquopeoplerdquoldquosettlersrdquo ldquolastrdquo 56rdquo ldquoldquoalbertansrdquo ldquofirstrdquo ldquopeoplerdquo 56rdquoldquoldquoadventurousrdquo ldquoalbertansrdquo ldquofirstrdquo 56rdquo ldquoldquotherdquo ldquobarrdquoldquourdquo 55rdquo ldquoldquoopeningrdquo ldquoofrdquo ldquotherdquo 54rdquo ldquoldquocommunitiesrdquoldquoadventurousrdquo ldquoalbertansrdquo 54rdquo ldquoldquonewrdquo ldquocommunitiesrdquoldquoadventurousrdquo 54rdquo ldquoldquobarrdquo ldquourdquo ldquoranchrdquo 53rdquo ldquoldquolistenrdquoldquotordquo ldquotherdquo 52rdquo hellip

Through a few clicks we can navigate and determine the basiccontours mdash from a distance mdash of what the site pertains to In aviewer you can use a slider to dynamically see more informationor less information quickly toggling between bigrams trigramsand quadgrams allowing you to quickly hone in on immediately

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

relevant information This can be useful but we need to moveback even further to achieve a distant reading of this born-digitalcollection

If one zooms back out to this very distant level we can extractn-gram data from the entire collection We have two approachesfirst aggregate appearances of a given term (if ldquoaboriginalrdquo appears457 times then n=457) or second relative appearances of a giventerm (if ldquoaboriginalrdquo appears in 45 of 384 sites then n=45384 or01171875) In my other work on music lyrics the latter optionenables you to control for the lsquochorusrsquo effect as the former can be drowned out by one song with repetitive words Given thenature of this collection I elected for the former approach Whatcan we learn from trigrams for example After an initial set of trigrams pertaining to the Louisbourg digitization project whichhas boilerplate language on each page as well as common ldquostop-wordrdquo equivalents of phrases (especially pronounced in French) we come across the first set of meaningful phrases excerptedbelow

ldquoexperiencerdquo ldquoinrdquo ldquotherdquo 919 ldquotransferredrdquo ldquotordquo ldquotherdquo918 ldquofundedrdquo ldquobyrdquo ldquotherdquo 918 ldquoofrdquo ldquotherdquo ldquoprovincialrdquo915 ldquothisrdquo ldquowebrdquo ldquositerdquo 915 ldquo409rdquo ldquoregistryrdquo ldquo2rdquo914 ldquoarerdquo ldquonordquo ldquolongerrdquo 914 ldquogouvernementrdquo ldquodurdquo ldquocanadardquo 910 ldquouserdquo ldquothisrdquo ldquolinkrdquo 887ldquopagerdquo ldquoorrdquo ldquoyourdquo 887 ldquofollowingrdquo ldquopagerdquo ldquoorrdquo 887 ldquoautomaticallyrdquo ldquotransferredrdquo ldquotordquo 887 ldquoberdquoldquoautomaticallyrdquo ldquotransferredrdquo 887 ldquowillrdquo ldquoberdquo ldquoautomaticallyrdquo 887 ldquoindustryrdquo ldquoyourdquo ldquowillrdquo 887ldquomultimediardquo ldquoindustryrdquo ldquoyourdquo 887 ldquopracticalrdquo ldquoexperiencerdquo ldquoinrdquo 887 ldquogainedrdquo ldquopracticalrdquo ldquoexperiencerdquo887 ldquotheyrdquo ldquogainedrdquo ldquopracticalrdquo 887 ldquowhilerdquo ldquotheyrdquoldquogainedrdquo 887 ldquocanadiansrdquo ldquowhilerdquo ldquotheyrdquo 887ldquoyoungrdquo ldquocanadiansrdquo ldquowhilerdquo 887 ldquoshowcaserdquo ldquoofrdquoldquoworkrdquo 887 ldquoexcellentrdquo ldquoshowcaserdquo ldquoofrdquo 887 hellip

Here we see an array of common language funding informa-tion link information project information spread across over 887

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

websites Reading down we see information about governments jobexperience project information initiatives re-structuring and soforth At a glance we can see what language is common to all web-sites as opposed to unique information Moving down we find themore common but less universal ldquothe history ofrdquo for example ldquoher-itage community foundationrdquo libraries and so forth By the end ofthe list we can move into the unique or very rare phrases that appearonly once twice or three times

Moving beyond word and phrase frequency there are a fewmore sophisticated off-the-shelf options that can help us make senseof large arrays of unstructured data Topic modeling is among themost developed as Princetonrsquos David M Blei (a Computer Scienceprofessor) describes in a general introduction to the topic

While more and more texts are available online we simplydo not have the human power to read and study them toprovide the kind of [traditional] browsing experiencedescribed above To this end machine learning researchershave developed probabilistic topic modeling a suite of algo-rithms that aim to discover and annotate large archives ofdocuments with thematic information Topic modelingalgorithms are statistical methods that analyze the words ofthe original texts to discover the themes that run throughthem how those themes are connected to each other andhow they change over time hellip Topic modeling enables usto organize and summarize electronic archives at a scalethat would be impossible by human annotation55

Using the aforementioned SEASR Meandre workbench (or viaMALLET itself and a command-line interface) we can quickly beginto run our born-digital information through the topic-modelingalgorithm The results are notable and come in two formats a wordcloud visualization as well as an XML file containing a line-by-linebreakdown of each topic Taking the aforementioned file 267 (ldquopastand presentrdquo) we can begin to get a sense of the topics this large web-site discusses

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

At a glance we can learn about some of the main themes the struc-ture of Alberta (Alberta towns bridges lakes) the geographic shapeof the area (life land lands areas links grass and interestingly lan-guage) the stories (story knight archives but also interestinglyknight and wife) and so forth The word cloud gives us a sense ofthe most relevant terms that define each topic but we also have a3200-line XML file broken into topics which lets us further refineour inquiry

A last set of examples will help us move from the aggregate levelof hundreds of thousands of lines of text to the individual line inessence creating our own search engine to navigate this collection

There are a few good reasons why we should not use Google todo all of this in our own work First we do not know how Googleworks precisely It is a proprietary black box While the original algo-rithm is available the actual methods that go into determining theranking of search terms is a trade secret Second Google is optimizedfor a number of uses mainly to deliver the most relevant informa-tion to the user Historians and archivists are looking often forspecific information within old websites and it behooves us to exam-ine them through the lens of more specialized tools With thesehesitations in mind I have sought to develop my own rudimentarysearch engine for this born-digital collection

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A two-pronged approach can take us to relevant informationquickly In the following example I am carrying out a project onblack history We quickly search all of the CDC word frequency filesfor the occurrence of the word ldquofreedomrdquo ldquoblackrdquo and ldquolibertyrdquo(these terms of course require some background knowledge thesearch is shaped by the user) obtain a list and then map it onto achart in order to quickly visualize their locations in the collection Forexample in the graph above we see a visualization of a given term inthis case our three search terms As one moves the mouse over eachbar the frequency data for a given term appears We can also zoomand enhance this graph to use it as we wish A demonstration of thiscan be found online at wwwianmilliganca20130118cdc-tool-1

We can then move down into the data In the example aboveldquo047rdquo labeled ldquoblackgoldtxtrdquo contains 31 occurrences of the wordldquoontariordquo Moving to my Mathematica keyword-in-context routinewe can see where these occurrences of the word appear

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

A snapshot of the dynamic finding aid tool

We see that this website focuses mostly on southwestern Ontariowhere many oil fields are If we move forward we can see that it per-tains to the nineteenth century a particular region as well assubsequent memorialization activities (mention of a museum forexample) All this can be done from within the textual client asopposed to going out into individual websites Broad contoursemerge from this distant reading but they also facilitate deeper read-ing by allowing us to promptly realize critical sources

Another model through which we can visualize documents isthrough a viewer primarily focusing on term document frequencyand inverse frequency terms discussed below To prepare this mater-ial we are combining our earlier data about word frequency (that theword ldquoexperiencerdquo appears 919 times perhaps) with aggregate infor-mation about frequency in the entire corpus (ie the wordldquoexperiencerdquo appears 1766 times) The number of times that a termappears in a given website can be defined as tf (term frequency)Another opportunity is inverse document frequency where we canlearn what words are most distinct and unique in a given documentThe formula for this is the oft-cited tf-idf a staple of informationretrieval theory or tf times the logarithm of the number of docu-ments divided into the number of documents where the given termappears At a glance then we can see some of the unique words con-tained within a website if we select ldquoinverse document frequencyrdquo

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

At a very quick glance one can see the major words that make this aunique document This technique can be used on any textual docu-ment

Off-the-shelf tools can facilitate a deeper yet distant reading ofindividual websites as well especially the suite of research tools onwebsites such as Many Eyes and Voyant Using these tools howeverrequires that one puts individual data into each separate programrather than quickly navigating between them This stems from themassive amount of files and data involved Many Eyes limits files to5MB and Voyant has no hard and fast rule but can struggle withlarger files

This analysis was carried out on my personal computer55 Largerdata sets may necessitate a corresponding increase in computationalpower bringing social sciences and humanities researchers into con-tact with High Performance Computing (HPC) centres or consortiaThese extremely powerful computers provide researchers with com-puting power orders of magnitude above that of personal computersallowing results in hours instead of the weeks or months it might

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

take to process on a smaller system56 This has particular applicationin the basic processing of a very large web archive There are limita-tions however jobs must be submitted into queues in advancerather than on-site processing and perhaps most importantly HPCcentres prefer to run highly optimized code to minimize usage oftheir very expensive and scarcely allocated equipment As this level ofcoding expertise is often beyond humanities researchers researchteams must be formed

Humanities researchers would do well to familiarize themselveswith local HPC options many of which are more than happy towork on new and engaging questions These are accessible optionsin Ontario for example all but one university belongs to a HPCconsortium57 Yet as personal computing continues to dramaticallyincrease in speed and as storage prices continue to plummet textualanalysis is increasingly feasible on high-quality office systems An ini-tial exposure can generate questions and problems that if the needrears itself can subsequently be taken to the supercomputer

This case study is intended to provide one model for distantreading and the generation of finding aids in the interest of showingsome of the potential for working with born-digital sources It isintended to begin or jumpstart a conversation around the resourcesand skills necessary for the next generation A seemingly insur-mountable amount of information contained across over 73000files (as in the CDC collection) can be quickly navigated and iso-lated for researcher convenience

Conclusion Digital Literacy for the Next Generation ofHistorians

As a discipline I would like to suggest one path forward prepare thenext generation of historians at least some of them to deal with thedeluge of born-digital sources Fundamentally the challenge of born-digital sources is a pedagogical one For existing historians this is anissue of continual education for future historians an issue of profes-sional training (the annual Digital Humanities Summer Instituteheld at the University of Victoria each June is worth considering asan introduction to many digital concepts) Some off-the-shelf tools

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

can serve as gateways to deeper engagement notably the VoyantTools suite (wwwvoyant-toolsorg) and perhaps subsequently thelessons contained in the Programming Historian 2 That said digitalliteracy courses should become part of the essential pantheon of his-torical curricula This should not be mandatory but instead taughtalongside and within traditional historiography and research meth-ods courses In an era of diminishing resources and attention paid toknowledge translation and relevancy digital literacy courses are alsowell positioned to provide graduates with concrete portfolios andactionable skills If we want to be prepared to use digital method-ologies to study the post-1990 world we will need ready historians

Such a course should have six basic priorities First we must pro-vide historiographical context for new students on whatcomputational methods have been tried before How have quantita-tive and qualitative methods differed yet ultimately complementedeach other Second students in this course would need to develop anunderstanding of the digital basics cloud computing effectivebackup and versioning digital security and an understanding of howthe Internet is organized Third for effective cross-departmental andsustained engagement with digital history students must to learn howto digitally navigate conventional sources notable skill sets wouldinclude a familiarity with citation management software databasesand the possibility of using optical character recognition on a varietyof born-paper sources subsequently digitized through scanners orcameras This must also include a grasp of the scope and significanceof large digital depositories including Google Books the InternetArchive and others mentioned earlier Fourth students must learnhow to process large amounts of digital information this includes off-the-shelf visualization tools material discussed here and some of thephilosophical and methodological implications of a distance readingFifth students can learn basic programming through the ProgrammingHistorian Finally and perhaps most importantly students need anunderstanding of various ways to openly disseminate material on the web whether through blogging micro-blogging publishing webpages and so forth it is important to enable the great work that we do in the university to make its mark on the world Throughthis students will also come away from history departments with

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

comprehensive and marketable digital portfolios Such a coursewould focus more on homework than reading with special emphasisplaced on deliverable content such as websites interactive modelsand broadly implementable databases

Digital history and the readiness to work with born-digitalsources cannot stem out of pedagogy alone however Institutionalsupport is necessary While based on anecdotal evidence it appearsthat some university administrators have begun to embrace the shifttowards the digital humanities At the federal level SSHRC has iden-tified the Digital Economy as one of its priority areas58 Yet when itcomes to evaluation doctoral and post-doctoral researchers do nothave a ldquodigital humanitiesrdquo option open to them in terms of assessingcommittees leading to fears that their projects might be lost inbetween disciplinary areas of focus59 In light of the impending crisisof born-digital sources historians should encourage these initiatives

These sorts of digital projects do not always lend themselveswell to traditional publication venues The examples above from theCDC case study for example are best viewed as dynamically mal-leable models this will be one challenge as historians grapple withnew publication models This article argues first why digitalmethodologies are necessary and second how we can realize aspectsof this call The Internet is 20 years old and before historians realizeit we will be faced with the crisis of born-digital information over-load If history is to continue as the leading discipline inunderstanding the social and cultural past decisive movementtowards the digital humanities is necessary When the sea comes inwe must be ready to swim

IAN MILLIGAN is an assistant professor of Canadian history at theUniversity of Waterloo He is a founding co-editor of ActiveHistorycaan editor-at-large of the Programming Historian 2 and has written sev-eral articles in the fields of Canadian youth digital and labour history

IAN MILLIGAN est professeur adjoint en histoire canadienne agrave laUniversity of Waterloo Il est le fondateur et codirecteur du site

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Internet ActiveHistoryca ainsi qursquoun eacutediteur du site ProgrammingHistorian 2 Il a eacutecrit plusieurs articles au sujet de la jeunesse cana-dienne de la numeacuterisation et de lrsquohistoire des travailleurs

Endnotes

1 The question of when the past becomes a ldquohistoryrdquo to be professionallystudied is a tricky one Yet it is worth noting that the still-definitivework on the Baby Boom Generation appeared in 1996 with DougOwram Born at the Right Time A History of the Baby Boom Generationin Canada (Toronto University of Toronto Press 1996) This actuallyfollowed the earlier scholarly monograph Cyril Levitt Children ofPrivilege Student Revolt in the Sixties (Toronto University of TorontoPress 1984) and a host of other laudable works As academicdebatesworks on the 1960s now suggest the past becomes historysooner than we often expect

2 A critical point advanced in the field of literary criticism and theory byStephen Ramsay Reading Machines Towards an Algorithmic Criticism(Urbana University of Illinois Press 2011)

3 See ldquoIdleNoMore Timelinerdquo ongoingwwwidlenomoremakookcatimeline44 ltviewed 18 January 2013gt

4 These issues have been discussed elsewhere albeit in either a moreabridged form or with a focus on digitized historical sources A notableexception is Daniel J Cohen and Roy Rosenzweig Digital History AGuide to Gathering Preserving and Presenting the Past on the Web(Philadelphia University of Pennsylvania Press 2005) specifically theirchapter on ldquoCollecting History onlinerdquo Their focus is on the preservation-ist and archival challenges however Important notes are also found inDan Cohen ldquoHistory and the Second Decade of the Webrdquo RethinkingHistory 8 no 2 (June 2004) 293ndash301 also available online atwwwchnmgmueduessays-on-history-new-mediaessaysessayid=34

5 Figure taken from Jose Martinez ldquoTwitter CEO Dick Costolo Reveals Staggering Number of Tweets Per Dayrdquo ComplexTech wwwcomplexcomtech201210twitter-ceo-dick-costolo-reveals-staggering-number-of-tweets-per-day ltviewed 6 January 2013gtFurther more while a tweet itself is only 140 characters every tweetcomes with over 40 lines of metadata date information author biogra-phy and location information when the account was created the usersrsquofollowing and follower numbers country data and so forth This is aremarkable treasure trove of information for future historians but also apotential source of information overload See Sarah Perez ldquoThis is What

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

a Tweet Looks Likerdquo ReadWriteWeb (19 April 2010) wwwreadwritewebcomarchivesthis_is_what_a_tweet_looks_likephpltviewed 6 January 2013gt

6 This is further discussed in Ian Foster ldquoHow Computation ChangesResearchrdquo in Switching Codes Thinking Through Digital Technology inthe Humanities and the Arts eds Thomas Bartscherer and RoderickCoover (Chicago University of Chicago Press 2011) esp 22ndash4

7 Franco Moretti Graphs Maps Trees Abstract Models for Literary History(New York Verso 2005) 3ndash4

8 While not of prime focus here the longue dureacutee deserves definition Itstands opposite the history of events of instances covering instead avery long time span in an interdisciplinary social science framework In along essay Braudel noted that one can then see both structural crises ofeconomies and structures that constrain human society and develop-ment For more see Fernand Braudel ldquoHistory and the Social SciencesThe Longue Dureacuteerdquo in Fernand Braudel On History trans SarahMatthews (Chicago University of Chicago Press 1980) 27 The essaywas originally printed in the Annales ESC no 4 (OctoberndashDecember1958)

9 A great overview can be found at ldquoText Analysisrdquo Tooling Up for DigitalHumanities 2011 wwwtoolingupstanfordedupage_id=981 ltviewed19 December 2012gt

10 Jean-Baptiste Michel Yuan Kui Shen Aviva Presser Aiden Adrian VeresMatthew K Gray The Google Books Team Joseph P Pickett DaleHoiberg Dan Clancy Peter Norvig Jon Orwant Steven Pinker MartinA Nowak Erez Lieberman Aiden ldquoQuantitative Analysis of CultureUsing Millions of Digitized Booksrdquo Science 331 (14 January 2011)176ndash82 The tool is publicly accessible at ldquoGoogle Books NgramViewerrdquo Google Books 2010 wwwbooksgooglecomngrams ltviewed11 January 2013gt

11 Anthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

12 Comment by Jean-Baptiste Michel and Erez Lieberman Aiden onAnthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

13 For more information see the Canadian Century Research InfrastructureProject website at wwwccriuottawacaCCRIHomehtml and the Programme de recherche en deacutemographie historique at wwwgeneologyumontrealcafrleprdhhtm

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

14 ldquoWith Criminal Intentrdquo CriminalIntentorg wwwcriminalintentorgltviewed 20 December 2012gt

15 For an overview see ldquoAward Recipients mdash 2011rdquo Digging into DataWebpage wwwdiggingintodataorgHomeAwardRecipients2011tabid185Defaultaspx ltviewed 20 December 2012gt

16 One can do no better than the completely free Programming Historianwhich uses entirely free and open-source programming tools to learn thebasics of computational history See The Programming Historian 2 2012-Present wwwprogramminghistorianorg ltviewed 11 January 2013gt

17 Michael Katz The People of Hamilton Canada West Family and Class ina Mid-Nineteenth-Century City (Cambridge MA Harvard UniversityPress 1975) Also see A Gordon Darroch and Michael D OrnsteinldquoEthnicity and Occupational Structure in Canada in 1871 The VerticalMosaic in Historical Perspectiverdquo Canadian Historical Review 61 no 3(1980) 305ndash33

18 Ian Anderson ldquoHistory and Computingrdquo Making History The ChangingFace of the Profession in Britain 2008 wwwhistoryacukmakinghistoryresourcesarticleshistory_and_computinghtml ltviewed 20 December2012gt

19 Christopher Collins ldquoText and Document Visualizationrdquo TheHumanities and Technology Camp Greater Toronto Area VisualizationBootcamp University of Ontario Institute of Technology Oshawa Ont21 October 2011

20 The quotations by Luther and Poe are commonly used to make thispoint For the best example see Clay Shirky Cognitive SurplusCreativity and Generosity in a Connected Age Google eBook Penguin2010 For Mumford see Lewis Mumford The Myth of the Machine vol2 The Pentagon of Power (New York Harcourt Brace 1970) 182 asquoted in Gleick The Information A History A Theory A Flood (NewYork Pantheon 2011) 404

21 Adam Ostrow ldquoCisco CRS-3 Download Library of Congress in JustOver One Secondrdquo Mashable (9 March 2010) accessed onlinewwwmashablecom20100309cisco-crs-3

22 A note on data size is necessary The basic unit of information is a ldquobyterdquo (8 bits) which is roughly a single character A kilobyte (KB) is athousand bytes a megabyte (MB) a million a gigabyte (GB) a billion aterabyte (TB) a trillion and a petabyte (PB) a quadrillion Beyond thisthere are exabytes (EB) zettabytes (ZB) and yottabytes (YB)

23 These figures can of course vary depending on compression methodschosen Their math was based on an average of 8MB per book if it wasscanned at 600dpi and then compressed There are 26 million books inthe Library of Congress For more see Mike Ashenfelder ldquoTransferringlsquoLibraries of Congressrsquo of Datardquo Library of Congress Digital Preservation

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Blog 11 July 2011wwwblogslocgovdigitalpreservation201107transferring-libraries-of-congress-of-data ltviewed 11 January 2013gt

24 Gleick 232 25 ldquoAbout the Library of Congress Web Archivesrdquo Library of Congress Web

Archiving FAQs Undated wwwlocgovwebarchivingfaqhtmlfaqs_02ltviewed 11 January 2013gt

26 ldquo10000000000000000 Bytes Archivedrdquo Internet Archive Blog 26October 2012 wwwblogarchiveorg2012102610000000000000000-bytes-archivedltviewed 19 December 2012gt

27 Jean-Baptiste Michel et al 176ndash82 Published Online Ahead of Print16 December 2010 wwwsciencemagorgcontent3316014176ltviewed 29 July 2011gt See also Scott Cleland ldquoGooglersquos PiracyLiabilityrdquo Forbes 9 November 2011 wwwforbescomsitesscottcle-land20111109googles-piracy-liability ltviewed 11 January 2013gt

28 ldquoGovernment of Canada Web Archiverdquo Library and Archives Canada 17October 2007 wwwcollectionscanadagccawebarchivesindex-ehtmlltviewed 11 January 2013gt

29 For background see Bill Curry ldquoVisiting Library and Archives inOttawa Not Without an Appointmentrdquo Globe and Mail 1 May 2012wwwtheglobeandmailcomnewspoliticsottawa-notebookvisiting-library-and-archives-in-ottawa-not-without-an-appointmentarticle2418960 ltviewed 22 May 2012gt and the announcement ldquoLAC Begins Implementation of New Approach to service Deliveryrdquo Library and Archives Canada February 2012wwwcollectionscanadagccawhats-new013-560-ehtml ltviewed 22May 2012gt For professional responses see the ldquoCHArsquos comment onNew Directions for Library and Archives Canadardquo CHA Website March2012 wwwcha-shccaenNews_39items19html ltviewed 19December 2012gt and ldquoSave Library amp Archives Canadardquo CanadianAssociation of University Teachers launched AprilMay 2012 wwwsavelibraryarchivesca ltviewed 19 December 2012gt

30 Ian Milligan ldquoThe Smokescreen of lsquoModernizationrsquo at Library andArchives Canadardquo ActiveHistoryca 22 May 2012wwwactivehistoryca201205the-smokescreen-of-modernization-at-library-and-archives-canada ltviewed 19 December 2012gt

31 ldquoSave LAC May 2012 Campaign Updaterdquo SaveLibraryArchivesca May2012 wwwsavelibraryarchivescaupdate-2012-05aspx ltviewed 19December 2012gt

32 The challenges and opportunities presented by such projects is extensively discussed in Daniel J Cohen ldquoThe Future of Preserving the Pastrdquo CRM The Journal of Heritage Stewardship 2 no 2

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

(Summer 2005) 6ndash19 also available freely online at the RoyRosenzweig Center for History and New Mediawwwchnmgmueduessays-on-history-new-mediaessaysessayid=39For more see ldquoSeptember 11 Digital Archive Saving the Histories of September 11 2001rdquo Centre for History and New Media andAmerican Social History Project Center for Media and Learningwww911digitalarchiveorg ltviewed 11 January 2013gt ldquoHurricaneDigital Memory Bank Collecting and Preserving the Stories of Katrina and Ritardquo Center for History and New Media wwwhurricanearchiveorg ltviewed 11 January 2013gt and ldquoOccupyArchive Archiving the Occupy Movements from 2011rdquo Centre forHistory and New Media wwwoccupyarchiveorg ltviewed 11 January2013gt

33 Daniel Cohen and Roy Rosenzweig Digital History A Guide to Gathering Preserving and Presenting the Past on the Web(Philadelphia Temple University Press 2005) chapter on ldquoPreservingDigital Historyrdquo The chapter is available fully online atwwwchnmgmuedudigitalhistorypreserving5php ltviewed 19December 2012gt

34 For an overview see ldquoPreserving Our Digital Heritage The NationalDigital Information Infrastructure and Preservation Program 2010 Report a Collaborative Initiative of the Library of CongressrdquoDigitalpreservationgov wwwdigitalpreservationgovmultimediadocumentsNDIIPP2010Report_Postpdf ltviewed 19 December 2012gt

35 Steve Meloan ldquoNo Way to Run a Culturerdquo Wired 13 February 1998available online via the Internet Archivewwwwebarchiveorgweb20000619001705httpwwwwiredcomnewsculture012841030100html ltviewed 19 December 2012gt

36 Some of this has been discussed in greater depth on my website See theseries beginning with Ian Milligan ldquoWARC Files A Challenge forHistorians and Finding Needles in Haystacksrdquo ianmilliganca 12December 2012 wwwianmilliganca20121212warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks ltviewed 20 December2012gt

37 Unstructured data as opposed to structured data is important for histo-rians but brings additional challenges Some projects have producedstructured data novels marked up using XML encoding for example soone can immediately see place names dates character names and soforth This makes it much easier for a computer to read and understand

38 See William Noblett ldquoDigitization A Cautionary Talerdquo New Review ofAcademic Librarianship 17 (2011) 3 Tobias Blanke Michael Bryantand Mark Hedges ldquoOcropodium Open Source OCR for Small-ScaleHistorical Archivesrdquo Journal of Information Science 38 no 1 (2012) 83

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 5: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

century such as how Canadians understood Idle No More or theOccupy movement through social media What would her archivelook like I like to imagine it as boxes stretching out into the dis-tance tapering off without any immediate reference points to bringit into perspective Many of these digital archives will be withouthuman-generated finding aids would have perhaps no dividers orheadings within ldquofilesrdquo and certainly no archivist with a compre-hensive grasp of the collection contents A recent example isillustrative During the IdleNoMore protests Twitter witnessed anastounding 55334 tweets on 11 January 2013 Each tweet can be up to 140 characters To put that into perspective this article is lessthan 900 lines long in a word processor Yet this massive amount ofinformation looks cryptic strange usernames no archives no fold-ers not even an immediate way to see whether a given tweet isrelevant to your research or not3 This is a vast collection Informationoverload

This example is indicative of the issues facing the historical pro-fession4 I am not exaggerating Every day half a billion tweets aresent5 hundreds of thousands of comments are uploaded to newssites and scores of blog posts are uploaded This makes up one smallaspect of broader data sets automated logs including climate read-ings security access reports library activity search patterns booksand movies ordered from Amazonca or Netflix6 Even if only a smallfraction of this is available to future historians it will represent anunparalleled source of information about how life was lived (orthought to have been lived) during the early-twenty-first century Toprovide another perhaps more tangible example while carrying outmy previous work on youth cultures in 1960s English-Canada I wasstruck by how inaccessible television sources were compared to news-paper records despite television being arguably the primary media ofthe time Unlike the 1960s however the Internet leaves a plethoraof available mdash if difficult to access mdash primary sources for historiansIn the digital age archiving will be different which means that thehistory we write will be different

We must begin planning for this plethora of born-digital sourcesnow Using modern technology we have the ability to quickly findneedles in digital haystacks to look inside a lsquoboxrsquo and immediately

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

find that relevant letter rather than spending hours searching andreading everything Yet we also have the ability to pull our gaze backdistantly elucidating the context to our relevant documents as well

The Necessity of Distant Reading Historians andComputational History

Franco Moretti in his 2005 groundbreaking work of literary criti-cism Graphs Maps Trees called on his colleagues to rethink theircraft With the advent of mass produced novels in nineteenth-cen-tury Britain Moretti argued that scholars must change theirdisciplinary approach Traditionally literary scholars worked with acorpus of around 200 novels though an impressive number for theaverage person it represents only around one percent of the totaloutput of that period To work with all of this material in an effortto understand these novels as a genre critics needed to approachresearch questions in a different way

[C]lose reading wonrsquot help here a novel a day every day ofthe year would take a century or so hellip And itrsquos not evena matter of time but of method a field this large cannotbe understood by stitching together separate bit of knowl-edge about individual cases because it isnrsquot a sum ofindividual cases itrsquos a collective system that should begrasped as such as a whole7

Instead of close reading Moretti called for literary theorists to prac-tice ldquodistant readingrdquo

For inspiration Moretti drew on the work of French Annaleshistorian Fernand Braudel Braudel amongst the most instrumentalhistorians of the twentieth century pioneered a similarly distantapproach to history his inquiries spanned large expanses of time andspace that of civilizations the Mediterranean world as well assmaller (only by comparison) regional histories of Italy and FranceHis distant approach to history did not necessitate disengagementwith the human actors on the ground seeing instead that beyond thenarrow focus of individual events lay the constant ebb and flow of

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

poverty and other endemic structural features8 While his methodol-ogy does not apply itself well to rapidly changing societies (his focuswas on the long-term slow change) his points around distant read-ing and the need for collaborative interdisciplinary research are auseful antecedent Similar opportunities present themselves whendealing with the vast array of born-digital sources (as well as digitizedhistorical sources) we can distantly read large arrays of sources froma longue dureacutee of time and space mediated through algorithms andcomputer interfaces opening up the ability to consider the voices ofeverybody preserved in the source

Historians must be open to the digital turn due to the astound-ing growth of digital sources and an increasing technical ability toprocess them on a mass scale Social sciences and humanitiesresearchers have begun to witness a profound transformation in howthey conduct and disseminate their work Indeed historical digitalsources have reached a scale where they defy conventional analysisand now require computational analysis Responding to thisarchives are increasingly committed to preserving cultural heritagematerials in digital forms

I would like to pre-emptively respond to two criticisms of thisdigital shift within the historical profession First historians will notall have to become programmers Just as not all historians need a firmgrasp of Geographical Information Systems (GIS) or a developedunderstanding of the methodological implications of community-based oral history the incorporation of a transnational perspective orin-depth engagement with cutting edge demographic models not allhistorians have to approach their trade from a computational per-spective Nor should they Digital history does not replace closereading traditional archival inquiry or going into communities mdash touse only a few examples mdash to uncover notions of collective memoryor trauma Indeed digital historians will play a facilitative role andprovide a broader reading context Yet they will still be historians col-lecting relevant sources analyzing and contextualizing them situatingthem in convincing narratives or explanatory frameworks and dis-seminating their findings to wider audiences

Neither will historians replace professional programmers Insteadas with other subfields historians using digital sources will need to be

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

prepared to work on larger truly interdisciplinary teams Computer sci-entists statisticians and mathematicians for example have criticalmethodological and analytical contributions to make towards questionsof historical inquiry Historians need to forge further links with thesedisciplines Such meaningful collaboration however will require histo-rians who can speak their languages and who have an understanding oftheir work Some of us must be able to discuss computational or algo-rithmic methodologies when we collectively explore the past

The methodologies discussed in this paper fall into the twinfields of data mining and textual analysis The former refers to thediscovery of patterns or information about large amounts of infor-mation while the latter refers to the use of digital tools to analyzelarge quantities of text9 The case study here draws upon three mainconcepts information retrieval (quickly finding what you want)information and term frequency (how often do various concepts andterms appear) and topic modeling (finding frequently occurringbags of words throughout documents) Other projects have involvedauthorship attribution which is finding out the author of docu-ments based on previous patterns style analysis or creating networkgraphs of topics groups documents or individuals

The Culturomics project run by the Cultural Observatory atHarvard University is a notable model of collaboration on a big his-tory project As laid out in Science and put online as the GoogleBooks n-gram viewer (wwwbooksgooglecomngrams) this projectindexed word and phrase frequency across over five million booksenabling researchers to trace the rise and fall of cultural ideas andphenomena through targeted keyword and phrase searches10 This isthe most ambitious and certainly the most advanced and accessibleBig History project yet launched Yet amongst the extensive authorlist (13 individuals and the Google Books team) there were no his-torians This suggests that other disciplines in the humanities andsocial sciences have embraced the digital turn far more than histori-ans However as I mentioned in the coming decades historians willincreasingly rely on digital archives There is a risk that history as aprofessional discipline could be left behind

American Historical Association President Anthony Graftonnoted this absence in a widely circulated Perspectives article Why was

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

there no historian he asked when this was both primarily a histori-cal project and a team comprised of varied doctoral holders fromdisciplines such as English literature psychology computer sciencebiology and mathematics12 Culturomics project leaders ErezLeiberman Aiden and Jean-Baptiste Michel responded franklyWhile the team approached historians and some did play advisoryroles every author listed ldquodirectly contributed to either the creationof the collection of written texts (the lsquocorpusrsquo) or to the design andexecution of the specific analyses we performed No academic histo-rians met this barrdquo They continued to indict the profession

The historians who came to the meeting were intelligentkind and encouraging But they didnrsquot seem to have agood sense of how to wield quantitative data to answerquestions didnrsquot have relevant computational skills anddidnrsquot seem to have the time to dedicate to a big multi-author collaboration Itrsquos not their fault these things donrsquotappear to be taught or encouraged in history departmentsright now13

This is a serious indictment considering the pending archival shift todigital sources and one that still largely holds true (a few historydepartments now offer digital history courses but they are still out-liers) This form of history the massive analysis of digitizedinformation will continue Furthermore these approaches willattract public attention as they lend themselves well to accessibleonline deployment Unless historians are well positioned they riskceding professional ground to other disciplines

While historians were not involved in Culturomics they arehowever involved in several other digital projects Two major andambitious long-term Canadian projects are currently using compu-tational methods to reconstruct the demographic past The first isthe Canadian Century Research Infrastructure project which is cre-ating databases from five censuses and aiming to provide ldquoa newfoundation for the study of social economic cultural and politicalchangerdquo while the second is the Programme de recherche en deacutemo-graphie historique hosted by the Universiteacute de Montreacuteal13 They are

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

reconstructing the European population of Quebec in the seven-teenth and eighteenth centuries drawing heavily on parish registersFurthermore several projects in England have harnessed the wave ofinterest in ldquobig datardquo to tell compelling historical stories about theirpasts the Criminal Intent project for example used the proceedingsof the Old Bailey courts (the main criminal court in Greater London)to visualize large quantities of historical information14 The CriminalIntent project was itself funded by the Digging into Data Challengewhich brings together multinational research teams and fundingfrom federal-level funding agencies in Canada the United States theUnited Kingdom and beyond Several other historical studies havebeen funded under their auspices from studies of commodity tradein the eighteenth- and nineteenth-century Atlantic world (involvingscholars at the University of Edinburgh the University of StAndrews and York University) to data mining the degrees of eco-nomic opportunity and spatial mobility in Britain Canada and theUnited States (bringing researchers at the University of Guelph theUniversity of Leicester and the University of Minnesota together)15

These exciting interdisciplinary teams may be heralding a disciplinalshift although it is still early days

Collaborative teams have limitations though We will have tobe prepared to tackle digital sources alone on shoestring budgetsHistorians will not always be able to call on external programmers orspecialists to do work because teams often develop out of fundinggrants As governmental and university budgets often prove unableto meet consistent research needs baseline technical knowledge isrequired for independent and unfunded work If historians relyexclusively on outside help to deal with born-digital sources the abil-ity to do computational history diminishes As the success of theProgramming Historian indicates and as I will demonstrate in mycase study below historians can learn basic techniques16 Let ustweak our own algorithms

A social or cultural history of the very late-twentieth or twenty-first century will have to account for born-digital sources andhistorians will (in conjunction with archival professionals) have tocurate and make sense of the data Historians need to be able toengage with these sources themselves or at least as part of a team

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

rather than having sources mediated through the lens of outside dis-ciplines or commercial interests Algorithms may be one meaningfulway to make sense of cultural trends or extract meaningful informa-tion from billions of tweets but we must ensure professionalhistorical input goes into their crafting

The Third Wave of Computational History

Historians have fruitfully used computers in the past as part of twoprevious waves The 1960s and 1970s saw computational tools usedfor demographic population and economic histories a trend that inCanada saw full realization in comprehensive studies such as MichaelKatzrsquos The People of Hamilton Canada West17 As Ian Anderson hasconvincingly noted in a review of history and computing studies ofthis sort saw the field become associated with quantitative studies18

After a retreat from the spotlight computational history rose again inthe 1990s with the advent of the personal computer graphical userinterfaces and overall improvements in ease of use Punch cards couldnow give way to database programs GIS and early online networkssuch as H-Net and Usenet We are now I believe on the cusp of athird revolution in computational history thanks to three main fac-tors decreasing storage costs the power of the Internet anddistributed cloud computing and the rise of professionals dealingwith both digital preservation and open-source tools

Decreasing storage costs have led to massive amounts of infor-mation being preserved as Gleick noted in this articlersquos epigraph Itis important to note that this information overload is not newPeople have long worried about the impact of too much informa-tion19 In the sixteenth century responding to the rise of the printingpress the German priest Martin Luther decried that the ldquomultitudeof books [were] a great evilrdquo In the nineteenth century Edgar AllanPoe bemoaned ldquo[t]he enormous multiplication of books in everybranch of knowledge is one of the greatest evils of this agerdquo Asrecently as 1970 American historian Lewis Mumford lamented ldquotheoverproduction of books will bring about a state of intellectual ener-vation and depletion hardly to be distinguished from massiveignorancerdquo20 The rise of born-digital sources must thus be seen in

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

this continuous context of handwringing around the expansion andrise of information

Useful historical information is being preserved at a rate that isaccelerating with every passing day One useful metric to comparedataset size is that of the American Library of Congress (LOC)Indeed a ldquoLOCrdquo has become shorthand for a unit of measurementin the digital humanities Measuring the actual extant of the printedcollection in terms of data is notoriously difficult James Gleick hasclaimed it to be around 10TB22 whereas the Library of Congressitself claims a more accurate figure is about 200TB23 While thesetwo figures are obviously divergent neither is on an unimaginablescale mdash personal computers now often ship with a terabyte or moreof storage When Claude Shannon the father of information theorycarried out a thought experiment in 1949 about items or places thatmight store ldquoinformationrdquo (an emerging field which viewed messagesin their aggregate) he scaled sites from a digit upwards to the largestcollection he could imagine The list scaled up logarithmically from10 to the power of 0 to 10 to the power of 13 If a single-spaced pageof typing is 104 107 the Proceedings of the Institute of Radio Engineers109 the entire text of the Encyclopaedia Brittanica then 1014 was ldquothelargest information stockpile he could think of the Library ofCongressrdquo24

But now the Library of Congressrsquo print collection mdash somethingthat has taken two centuries to gather mdash is dwarfed by their newestcollection of archived born-digital sources the vast majority only amatter of years old compared to the much wider date-range of non-digital sources in the traditional collection The LOC has beguncollecting its own born-digital Internet archive multiple snapshotsare taken of webpages to show how they change over time Even asa selective curated collection drawing on governmental politicaleducational and creative websites the LOC has already collected254TB of data and adds 5TB a month25 The Internet Archivethrough its Wayback Machine is even more ambitious It seeks toarchive every website on the Internet While its size is also hard toput an exact finger on as of late 2012 it had over 10 Petabytes ofinformation (or to put it into perspective a little over 50 Library ofCongresses if we take the 200TB figure)26

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

This is to some degree comparing apples and oranges theLibrary of Congress is predominantly print whereas the InternetArchive has considerable multimedia holdings which includesvideos images and audio files The LOC is furthermore a morecurated collection whereas the Internet Archive draws from a widerrange of producers The Internet offers the advantages and disadvan-tages of being a more democratic archive For example a Geocitiessite (an early way for anybody to create their own website) created bya 12-year old Canadian teenager in the mid-1990s might be pre-served by the Internet Archive whereas it almost certainly would notbe saved by a national archive This difference in size and collectionsmanagement however is at the root of the changing historiansrsquotoolkit If we make use of these large data troves we can access a newand broad range of historical subjects

Conventional sources are also part of this deluge Google Bookshas been digitizing the printed word and it currently has 15 millionworks completed by 2020 it audaciously aims to digitize every bookever written While copyright issues loom the project will allowresearchers to access aggregate word and phrase data as seen in theGoogle n-gram project27 In Canada Library and Archives Canada(LAC) has a more circumscribed digitization project collecting ldquoarepresentative sample of Canadian websitesrdquo focusing particularlyon Government of Canada webpages as well as a curated approach

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

preserving case studies such as the 2006 federal election Currentlyit has 4TB of federal government information and 7TB of totalinformation in its larger web archive (a figure which as I note belowis unlikely to drastically improve despite ostensible goals of archivallsquomodernizationrsquo)28

In Canada LAC has mdash on the surface mdash has been advancingan ambitious digitization program as it realizes the challenges nowfacing archivists and historians Daniel J Caron the Librarian andArchivist of Canada has been outspoken on this front with severalpublic addresses on the question of digital sources and archivesThroughout 2012 primary source digitization at LAC preoccupiedcritics who saw it as a means to depreciate on-site access TheCanadian Association of University Teachers the Canadian HistoricalAssociation as well as other professional organizations mountedcampaigns against this shift in resources29 LAC does have a pointhowever with their modernization agenda The state of Canadarsquosonline collections are small and sorely lacking when compared totheir on-site collections and LAC does need to modernize and achievethe laudable goal of reaching audiences beyond Ottawa

Nonetheless digitization has unfortunately been used as asmokescreen to depreciate overall service offerings30 The modern-ization agenda would also see 50 percent of the digitization staffcut31 LACrsquos online collections are small they do not work withonline developers through Application Programming Interfaces(APIs a way for a computer program to talk to another computerand speed up research mdash Canadianaca is incidentally a leader in thisarea) and there have been no promising indications that this state ofaffairs will change Digitization has not been comprehensive Onlinepreservation is important and historians must fight for both on-siteand on-line access

A number of historians have recognized this challenge and areplaying an instrumental role in preserving the born-digital past TheRoy Rosenzweig Center for History and New Media at GeorgeMason University launched several high-profile preservation pro-jects the September 11th Digital Archive the Hurricane DigitalMemory Bank (focusing on Hurricanes Katrina and Rita) and as ofwriting the Occupy archive32 Massive amounts of online content

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

is curated and preserved photographs news reports blog posts andnow tweets These complement more traditional efforts of collectingand preserving oral histories and personal recollections which arethen geo-tagged (tied to a particular geographical location) tran-scribed and placed online These archives serve a dual purposepreserving the past for future generations while also facilitating easydissemination for todayrsquos audiences

Digital preservation however is a topic of critical importanceMuch of our early digital heritage has been lost Many websites espe-cially those before the Internet Archiversquos 1996 web archiving projectlaunch are completely gone Early storage mediums (from tape toearly floppy disks) have become obsolete and are now nearly inac-cessible Compounding this software development has seen earlyproprietary file formats depreciated ignored and eventually madewholly inaccessible33 Fortunately the problem of digital preservationwas recognized by the 1990s and in 2000 the American governmentestablished the National Digital Information Infrastructure andPreservation Program34 As early as 1998 digital preservationistsargued ldquo[h]istorians will look back on this era hellip and see a period ofvery little information A lsquodigital gaprsquo will span from the beginningof the wide-spread use of the computer until the time we eventuallysolve this problem What wersquore all trying to do is to shorten thatgaprdquo To bear home their successes that quotation is taken from anInternet Archive snapshot of Wired magazinersquos website35

Indeed greater attention towards rigorous metadata (which isbriefly put data about data often preserved in plain text format)open-access file formats documented by the InternationalOrganization for Standardization (ISO) and dedicated digitalpreservation librarians and archivists gives us hope for the future36

It is my belief that cloud storage the online hosting of data in largethird-party offsite centres mitigates the issue of obsolete storagemedium Furthermore that the move towards open source file for-mats even amongst large commercial enterprises such as Microsoftprovides greater documentation and uniformity

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A Brief Case Study Accessing and Navigating Canadarsquos DigitalCollections With an Emphasis on Tools for Born-DigitalSources

What can we do about this digital deluge There are no simpleanswers but historians must begin to conceptualize new additions totheir traditional research and pedagogical toolkits In the section thatfollows I will do two things introduce a case study and several toolsboth proprietary and those that are open and available off-the-shelfto reveal one way forward

For a case study I have elected to use Canadarsquos DigitalCollections (CDC) a collection of lsquodeadrsquo websites removed fromtheir original location on the World Wide Web and now digitallyarchived at LAC In this section I will introduce the collection usingdigitally-archived sources from the early World Wide Web and dis-cuss my research methodology before illustrating how we can derivemeaningful data from massive arrays of unstructured data37

This is a case study of a web archive Several of these method-ologies could be applied to other formats including large arrays oftextual material whether born-digital or not A distant reading of thistype is not medium specific It does however assume that the text ishigh quality Digitized primary documents do not always have textlayers and if they are instituted computationally through OpticalCharacter Recognition errors can appear Accuracy rates for digitizedtextual documents can range from as low as 40 percent for seven-teenth and eighteenth century documents to 80ndash90 percent formodern microfilmed newspapers to over 99 percent for typesetword-processed documents38 That said while historians today maybe more interested in applying distant reading methodologies tomore traditional primary source materials it is my belief that in thenot-so-distant future websites will be an increasingly significantsource for social cultural and political historians

What are Canadarsquos Digital Collections In 1996 IndustryCanada funded by the federal Youth Employment Strategy facili-tated the development of Canadian heritage websites by youth(defined as anyone between the age of 15 and 30) It issued contractsto private corporations that agreed to use these individuals as web

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

developers for the twofold purpose of developing digital skills andinstilling respect for and engagement with Canadian heritage andpreservation39 The collection initially called SchoolNet butrebranded as Canadarsquos Digital Collections in 1999 was accessiblefrom the CDC homepage After 2004 the program was shut downand the vast majority of the websites were archived on LACrsquos web-page This collection is now the largest born-digital Internet archivethat LAC holds worthy of study as content as well as thinkingabout methodological approaches

Conventional records about the program are unavailable in thetraditional LAC collection unsurprising in light of both its recentcreation and backlogs in processing and accessioning archival mate-rials We can use one element of our born-digital toolkit the InternetArchiversquos WaybackMachine (its archive of preserved early webpages)to get a sense of what this project looked like in its heyday The mainwebpage for the CDC was wwwcollectionsicgc However a visitortoday will see nothing but a collection of broken image links (two ofthem will bring you to the English and French-language versions ofthe LAC-hosted web archive although this is only discerniblethrough trial and error or through the sitersquos source code)

With the WaybackMachine we have a powerful albeit limitedopportunity to ldquogo back in timerdquo and experience the website as previ-ous web viewers would have (if a preserved website is available ofcourse) Among the over 420 million archived websites we can searchfor and find the original CDC site Between 12 October 1999 andthe most recent snapshot or web crawl (of the now-broken site) on 6July 2011 the site was crawled 410 times40 Unfortunately every siteversion until the 15 October 2000 crawl has been somehow cor-rupted Content has been instead replaced by Chinese characterswhich translate only into place names and other random words Issuesof early digital preservation noted above have reared their head inthis collection

From late 2000 onward however fully preserved CDC sites areaccessible Featured sites new entries success stories awards curric-ular units and prominent invitations to apply for project fundingpopulate the brightly coloured website The excitement is palpableJohn Manley then Industry Minister declared ldquoit is crucial that

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Canadarsquos young people get the skills and experience they need to par-ticipate in the knowledge-based economy of the futurerdquo and thatCDC ldquoshows that young people facing significant employment chal-lenges can contribute to and learn from the resources available onthe Information Highwayrdquo41 The documentation reflecting thelower level of technical expertise and scarcity of computer hardwareat the time was comprehensive participants were encouraged toidentify a collection digitize it process it (by cleaning up the data)preserve it on a hard drive and display it on the web Issues of copy-right were discussed as well as preservation issues and hardwarerequirements for computers (a computer with an Intel 386 processor8MB RAM a 1GB hard drive and 288bps modem was recom-mended) and training information for employers educators andparticipants42 Sample proposals community liaison ideas amidstother information set out a very comprehensive roadmap for a grass-roots bottom-up approach to digital preservation and heritage

Moving through the preserved websites is an interesting snap-shot at how web design standards changed Colour disappearsreplaced by a more subdued coordinated federal website standard(still largely in effect today) Surveys greet viewers from 2003onwards trying to learn how viewers discovered the site whetherthey think it is a good educational resource (they must have beenrelieved that the majority thought so mdash in 2003 some 84 percentfelt that it was a ldquogoodrdquo resource the best of its type)43 By 2004 theprogram was engaging in active remembrance of its past programsas it neared its ten-year anniversary Proposals stopped beingaccepted in July 2004 but the 600 webpages were still accessiblewith a comprehensive subject index alphabetical listing success sto-ries and detailed copyright information

The Canadarsquos Digital Collections site began to move towardsremoval in November 2006 and was initially completely gone A visi-tor to the site would learn that the content ldquois no longer availablerdquo andan e-mail address was provided for contributors to get their informationback44The information was completely inaccessible and by early 2007a bold notice announced that the now-empty site would close as of 2April 2007 Luckily by April 2007 a notice appeared with a declara-tion that the collections were now archived by Library and Archives

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Canada just in time for the website to completely disappear by July2007 In the archived version subject indexes and headings were goneAll that remained then and today for that matter is a straight-forwardalphabetical list and a disclaimer that noted ldquo[y]ou are viewing a doc-ument archived by Library and Archives Canada Please noteinformation may be out of date and some functionality lostrdquo

The question now is what we can do with this collection ofwebsites Could a distant reading as opposed to a close reading ona targeted level tell us something about this collectionFortunately we have a variety of tools that can help us navigate thislarge array of information several of which I use to navigate largeamounts of data First I create my own tools through computerprogramming This work is showcased in the following section I program in Wolfram Researchrsquos Mathematica 9 Mathematica is anintegrated platform for technical computing that allows users toprocess visualize and interact with exceptional arrays of informa-tion45 It is a language with an engaged and generous user base thatcomplements exhaustive well-written documentation fromWolfram Research To emphasize its ease one line of code couldimport a website in perfectly readable plaintext (avoiding unicodeerrors) as follows input=Import[ldquohttpactivehis-torycaldquordquoPlaintextrdquo] and a second one could make itall in lower text for textual analysis as such lowerpage=ToLowerCase[input] Variables are a critical building blockIn the former example ldquoinputrdquo now holds the entire plaintext of thewebsite ActiveHistoryca and then ldquolowerpagerdquo holds it in lowercaseOnce we have lower-cased information we are quickly able to extractthe following frequent words frequent phrases and could thus cre-ate a dynamic database of n-grams word clouds or run other textualanalysis programs on the material Much of the syntax adheres to arecognizable English-language format

Even more promising is Mathematicarsquos integration with theWolfram|Alpha database (itself accessible at wolframalphacom) acomputational knowledge engine that promises to put incredibleamounts of information at a userrsquos fingertips With a quick com-mand one can access the temperature at Torontorsquos Lester B PearsonInternational Airport at 500 pm on 18 December 1983 the rela-

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

tive popularity of names such as ldquoEdithrdquo or ldquoEmmardquo the Canadianunemployment rate in March 1985 or the Canadian GrossDomestic Product in December 199546 This allows one to quicklycross-reference findings with statistical information or find coinci-dental points of comparison As universities are now being activelyencouraged to contribute their information to Wolfram|Alpha thiswill become an increasingly powerful database

For many researchers the downside of Mathematica is a con-siderable one It is proprietary software and relatively expensive forstudent and non-student alike Outputs however can be dissemi-nated using the Computational Document Format (CDF) enablinghistorians to put their work online and entice the general public todynamically work with their models and findings

Programming is not the be all and end all Information visual-ization in the humanities is a growing international field withCanadian scholars heavily involved The most important and acces-sible example is Voyant Tools47 It is a powerful suite of textualanalysis tools Uploading onersquos own text or collection of texts draw-ing on a webpage or analyzing existing bodies of work one sees thefollowing a ldquoword cloudrdquo of a corpus click on any word to see itsrelative rise and fall over time (word trends) access the original textand click on any word to see its contextual placement within indi-vidual sentences It ably facilitates the ldquodistancerdquo reading of anyreasonably-sized body of textual information and should be a begin-ning point for those interested in this field The only downside isthat Voyant can choke on very large amounts of information as it isa textual analysis rather than robust data mining tool

There are several other complementary tools A notable one is theMany Eyes website hosted by IBM Research Many Eyes allows you toupload datasets be they textual tabular or numerical There are manyvisualization options provided you can plot relationships betweendata compare values (ie bar charts and bubble charts) track dataover time and apply information to maps The only downside toMany Eyes is that you must register and make your datasets available toall other registered researchers as well as IBM This limits your abilityto use copyrighted or proprietary information

More advanced technical users can avail themselves of a host of

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

textual analysis programs found elsewhere on the web The mostcomprehensive gathering of these tools can be found with theSoftware Environment for the Advancement of Scholarly Research(SEASR) and its Meandre workbench Meandre which can beinstalled as a local server on your system or mdash in a best case scenariomdash hosted by an institutional IT department allows you to tap intoa variety of workflows to analyze data48 One workflow mentionedin this article is the topic modeling toolkit found within the MachineLearning for Language Toolkit (MALLET) package which clustersunstructured data into ldquotopicsrdquo which appear together giving a quicksense of a documentrsquos structure49 Other workflows found withinSEASR include Dunningrsquos log-likelihood statistic which compares aselected document to a broader reference corpus For example youcould compare the lyrics that appear in 1968 against all lyrics in thepostwar era and learn what makes 1968 most unique50 sentimentanalyses which attempt to plot changes in emotional words andphrases over time and a variety of entity extraction algorithms(which find ldquopeoplerdquo ldquoplacesrdquo and so forth in a text) Corpuses thatwere previously too big to extract meaningful information from cannow be explored relatively quickly

As befits a technical side to history all of the freely availabletools discussed briefly here (notably wget SEASR and MALLET)are discussed in more technical terms in the peer-reviewedProgramming Historian 2 or on my own blog at ianmilliganca I haveelected to leave them online rather than provide commands in thisarticle they require some technical knowledge and directly workingfrom the web is quicker than here Please consider the following sec-tion an applied version of some of the lessons you can learn throughthese processes

From Web Collection to Dynamic Finding Aids

Let us briefly return to our archival metaphor Todayrsquos historiandealing with conventional sources often has to physically travel to anarchive requisition boxes perhaps take photographs of materials iftime is tight and otherwise take notes With born-digital sourceshowever the formula is reversed Our sources can come to us In a

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

few keystrokes we can bring an unparalleled quantity of informationto our home or work computers With Canadarsquos Digital Collectionsa single command line entry brought the entire archive onto my sys-tem it took days to process the information and I had to leave mycomputer on but it did not require any active user engagement

How big is the CDC collection in terms of sheer informationThe plaintext without any HTML markup is approximately7805135 words If each page contained 250 words it would be theequivalent of a 31220 page essay or 1000 dissertations However itis even more complicated than just reading 1000 dissertations Theinformation is spread across 73585 separate HTML files themselvespart of a much broader corpus of 278785 files Even the sheer sizeof the database requires big data methodologies an aggregate size of726 GB 360 MB of plaintext with HTML encoding reduced to alsquomerersquo 50MB if all encoding is removed A 50MB plain text docu-ment can quickly slow down even a relatively powerful personalcomputer There is no simple way to make this a conventionallyaccessible amount of information

How can we access this information Several different methodsare possible First we can work with the information on the internetitself or we can download it onto our own system I prefer the lattermethod as it decreases computational time adds flexibility andenables free rein to experiment with material Before you do thisthough you need to be familiar with the websitersquos ldquoterms of servicerdquoor the legal documentation that outlines how users are allowed to usea website resources or software In this case LAC provides generousnon-commercial reproduction terms information ldquomay be repro-duced in part or in whole and by any means without charge orfurther permission unless otherwise specifiedrdquo51

The CDC web archive index is organized like so the index islocated at wwwepelac-bacgcca100205301iccdcEAlphabetaspwith every subsequent website branching off the cdc folder Forexample one website would be hosted in wwwepelac-bacgcca100205301iccdcABbenevolat the next in hellipcdcabnatureand so on Using wget a free software program you can downloadall websites in a certain structure the program recursively downloadsinformation in a methodical fashion I first used wget to save the

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

cookies (little files kept on your system in this case one that notedthat you had seen the disclaimer about the archival nature of thewebsite) and then with an one-line command in my command lineto download the entire database52 Due to the imposition of a ran-dom wait and bandwidth limit which meant that I would downloadone file every half to one and a half seconds at a maximum of 30kilobytes a second the process took six days In the ProgrammingHistorian 2 I have an article expanding upon wget for humanitiesresearchers

Hypertext itself deserves a brief discussion The basic buildingblock of the web is the link one page is connected to another andpeople navigate through these webs While books and articles forexample are connected to each other by virtue of ideas and concretecitations these must often be inferred online HTML born-digitalsources are different in that the link offers a very concrete file One

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Visualizing the CDC Collection as a Series of Hyperlinks

quick visualization the circular cluster seen above lets us see theunique nature of this information We can go to each web pagescrape the links and see where they take viewers As we see in thebelow figure the CDC collection can be visualized as a series ofhyperlinks (each line) The cluster in the middle is the CDC collec-tion itself whereas the strands on the outside represent externalwebsites Using Mathematica I have generated a dynamic modelyou can move your mouse over each line learn what link it is andthus see where the collection fits into the broader Internet The siteis limited to following two links beyond the first page Through thisapproach we see the interconnection of knowledge and relationshipsbetween different sources At a glimpse we can see suggestions to fol-low other sites and determine if any nodes are shared amongst thevarious websites revealed

Visualization Moving Beyond Word Clouds

Where to begin with a large collection such as the Canadarsquos DigitalCollection corpus As a historian who is primarily interested in thewritten word my first step was to isolate the plaintext throughout thematerial so that I could begin rudimentary data mining and textualanalysis on the corpus For this I approached the CDC collection ontwo distinct levels mdash first the aggregate collection and second divid-ing the collection into each lsquofilersquo or individual website mdash derivingword and phrase data for each Imagine if you could look throughinto an archival collection and get a quick sense of what topics andideas were covered Taking that thought further imagine the utility ifyou could do this with individual archival boxes and even files Witha plethora of born-digital sources this is now a necessary step

On an aggregate level word and phrase data is a way to explorethe information within How often do various words within thenearly eight million words of the collection appear UsingMathematica we can quickly generate a list of words sorted by theirrelative frequency we learn that the word ldquoCanadardquo appears 26429times ldquocollectionsrdquo 11436 and ldquo[L]ouisbourgrdquo an astounding9365 times Reading such a list can quickly wear out a reader and isof minimal visual attractiveness Two quick decisions were made to

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

enhance data usability first the information was ldquocase-foldedrdquo allsent to lower-case this enables us to count the total appearances of aword such as ldquodigitalrdquo regardless of whether or not it appears at thebeginning of a sentence

Second when dealing with word frequency mdash as opposed tophrase frequency mdash ldquostop wordsrdquo should be removed as they werehere These are ldquoextremely common words which would appear to beof little value in helping select documents matching a user needrdquowords such as ldquotherdquo ldquoandrdquo ldquotordquo and so forth They matter forphrases (the example ldquoto be or not to berdquo is comprised entirely ofstop words) but will quickly dominate any word-level visualizationor list53 Other words may need to be added or removed from stoplists depending on the corpus for example if the phrase ldquopage num-berrdquo appeared on every page that would be useless information forthe researcher

The first step then is to use Mathematica to format word fre-quency information into Wordlenet readable format (if we put allof the text into that website it will crash mdash so we need to followthe instructions on that website) we can then draw up a wordcloud of the entire collection Here we can see both the advantagesand disadvantages of static word clouds What can we learn from this

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A Word Cloud of All Words in the CDC Corpus

illustration First without knowing anything about the sites we canimmediately learn that it is dealing with CanadaCanadians that itis a collection with documents in French and English that we aredealing with digitized collections and that there is a fair amount ofregional representation not based upon population (note thatldquoontariordquo is less frequent than both ldquonewfoundlandrdquo ldquonova scotiardquoand ldquoalbertardquo although ldquoqueacutebecrdquo and ldquoquebecrdquo is not terribly promi-nent) Certain kinds of history dominate communities familynames the built form (houses churches schools and so forth) At avery macro level we can learn several characteristics before doing anydeeper investigation

These forms of visualization need to be used with extreme cau-tion however First while visually attractive it leaves a severeamount of ambiguity are ldquoontariordquo and ldquoqueacutebecrdquo equally promi-nent for example or is there a slight divergence Even moreproblematically we cannot learn context from the above picturedoes the word ldquoschoolrdquo refer to built structures or educational expe-riences The same question with applies to ldquochurchrdquo Do mentionsof ldquonewfoundlandrdquo refer to historical experiences there or withexpatriates or in travel narratives Word clouds are useful first stepsand have their place but need to be used with extreme caution Formore scholarly applications tables offer much more precise inter-pretation As the second figure below demonstrates they can bemore helpful when dealing with the individual sites cuttingthrough ambiguity The figure below is a visualization of the web-site ldquopast and presentrdquo Despite the ambiguous title we can quicklylearn what this website contains from this visualization Wordle isthus a useful first step analogous to taking magically taking a quicklook through the side of an archival box We still need more infor-mation however to really make this form of textual analysishelpful To navigate this problem I have created my own dynamicfinding aid program using Mathematica which scales to any born-digital web collection

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Two approaches were taken First a textual representation of wordand phrase frequency allows us to determine the content of a given filerelatively quickly I wrote a short program to extract word frequency(stop words excluded) and phrase frequency For phrases we speak ofn-grams bigrams trigrams quadgrams and fivegrams although onecould theoretically do any higher number A bigram can be a combina-tion of two characters such as lsquobirsquo two syllables or mdash pertinent in ourcase mdash two words An example ldquocanada isrdquo A trigram is three wordsquadgram is four words and a fivegram is five words and so on

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Word Cloud of the File ldquoPast and Presentrdquo

A viewer for n-grams in the CDC Collection

The CDC collection demonstrates this approachrsquos utility Usingmy Mathematica textual CDC browser pictured above I randomlyexplored file 267 the ambiguously named ldquopast and presentrdquo web-site This is one of the largest individual websites 122363 words orabout 493 standard typed pages Such an ambiguous name could beabout almost anything Initially I decided to study word frequency

ldquoldquoalbertardquo 758rdquo ldquoldquolistenrdquo 490rdquo ldquoldquosettlersrdquo 467rdquoldquoldquonewrdquo 454rdquo ldquoldquocanadardquo 388rdquo ldquoldquolandrdquo 362rdquo ldquoldquoreadrdquo328rdquo ldquoldquopeoplerdquo 295rdquo ldquoldquosettlementrdquo 276rdquo ldquoldquochineserdquo268rdquo ldquoldquocanadianrdquo 264rdquo ldquoldquoplacerdquo 255rdquo ldquoldquoheritagerdquo254rdquo ldquoldquoedmontonrdquo 250rdquo ldquoldquoyearsrdquo 238rdquo ldquoldquowestrdquo 236rdquoldquoldquoraymondrdquo 218rdquo ldquoldquoukrainianrdquo 216rdquo ldquoldquocamerdquo 213rdquoldquoldquocalgaryrdquo 209rdquo ldquoldquonamesrdquo 206rdquo ldquoldquoranchrdquo 204rdquoldquoldquoupdatedrdquo 193rdquo ldquoldquohistoryrdquo 190rdquo ldquoldquocommunitiesrdquo183rdquo hellip

Rather than having to visit the website at a glance we can almostimmediately see what the site is about More topics become apparentwhen one looks at trigrams for example in this automatically gen-erated output of generated top phrases

ldquoldquofirstrdquo ldquopeoplerdquo ldquosettlersrdquo 100rdquo ldquoldquotherdquo ldquounitedrdquoldquostatesrdquo 72rdquo ldquoldquoonerdquo ldquoofrdquo ldquotherdquo 59rdquo ldquoldquoofrdquo ldquotherdquoldquowestrdquo 58rdquo ldquoldquosettlersrdquo ldquolastrdquo ldquoupdatedrdquo 56rdquo ldquoldquopeoplerdquoldquosettlersrdquo ldquolastrdquo 56rdquo ldquoldquoalbertansrdquo ldquofirstrdquo ldquopeoplerdquo 56rdquoldquoldquoadventurousrdquo ldquoalbertansrdquo ldquofirstrdquo 56rdquo ldquoldquotherdquo ldquobarrdquoldquourdquo 55rdquo ldquoldquoopeningrdquo ldquoofrdquo ldquotherdquo 54rdquo ldquoldquocommunitiesrdquoldquoadventurousrdquo ldquoalbertansrdquo 54rdquo ldquoldquonewrdquo ldquocommunitiesrdquoldquoadventurousrdquo 54rdquo ldquoldquobarrdquo ldquourdquo ldquoranchrdquo 53rdquo ldquoldquolistenrdquoldquotordquo ldquotherdquo 52rdquo hellip

Through a few clicks we can navigate and determine the basiccontours mdash from a distance mdash of what the site pertains to In aviewer you can use a slider to dynamically see more informationor less information quickly toggling between bigrams trigramsand quadgrams allowing you to quickly hone in on immediately

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

relevant information This can be useful but we need to moveback even further to achieve a distant reading of this born-digitalcollection

If one zooms back out to this very distant level we can extractn-gram data from the entire collection We have two approachesfirst aggregate appearances of a given term (if ldquoaboriginalrdquo appears457 times then n=457) or second relative appearances of a giventerm (if ldquoaboriginalrdquo appears in 45 of 384 sites then n=45384 or01171875) In my other work on music lyrics the latter optionenables you to control for the lsquochorusrsquo effect as the former can be drowned out by one song with repetitive words Given thenature of this collection I elected for the former approach Whatcan we learn from trigrams for example After an initial set of trigrams pertaining to the Louisbourg digitization project whichhas boilerplate language on each page as well as common ldquostop-wordrdquo equivalents of phrases (especially pronounced in French) we come across the first set of meaningful phrases excerptedbelow

ldquoexperiencerdquo ldquoinrdquo ldquotherdquo 919 ldquotransferredrdquo ldquotordquo ldquotherdquo918 ldquofundedrdquo ldquobyrdquo ldquotherdquo 918 ldquoofrdquo ldquotherdquo ldquoprovincialrdquo915 ldquothisrdquo ldquowebrdquo ldquositerdquo 915 ldquo409rdquo ldquoregistryrdquo ldquo2rdquo914 ldquoarerdquo ldquonordquo ldquolongerrdquo 914 ldquogouvernementrdquo ldquodurdquo ldquocanadardquo 910 ldquouserdquo ldquothisrdquo ldquolinkrdquo 887ldquopagerdquo ldquoorrdquo ldquoyourdquo 887 ldquofollowingrdquo ldquopagerdquo ldquoorrdquo 887 ldquoautomaticallyrdquo ldquotransferredrdquo ldquotordquo 887 ldquoberdquoldquoautomaticallyrdquo ldquotransferredrdquo 887 ldquowillrdquo ldquoberdquo ldquoautomaticallyrdquo 887 ldquoindustryrdquo ldquoyourdquo ldquowillrdquo 887ldquomultimediardquo ldquoindustryrdquo ldquoyourdquo 887 ldquopracticalrdquo ldquoexperiencerdquo ldquoinrdquo 887 ldquogainedrdquo ldquopracticalrdquo ldquoexperiencerdquo887 ldquotheyrdquo ldquogainedrdquo ldquopracticalrdquo 887 ldquowhilerdquo ldquotheyrdquoldquogainedrdquo 887 ldquocanadiansrdquo ldquowhilerdquo ldquotheyrdquo 887ldquoyoungrdquo ldquocanadiansrdquo ldquowhilerdquo 887 ldquoshowcaserdquo ldquoofrdquoldquoworkrdquo 887 ldquoexcellentrdquo ldquoshowcaserdquo ldquoofrdquo 887 hellip

Here we see an array of common language funding informa-tion link information project information spread across over 887

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

websites Reading down we see information about governments jobexperience project information initiatives re-structuring and soforth At a glance we can see what language is common to all web-sites as opposed to unique information Moving down we find themore common but less universal ldquothe history ofrdquo for example ldquoher-itage community foundationrdquo libraries and so forth By the end ofthe list we can move into the unique or very rare phrases that appearonly once twice or three times

Moving beyond word and phrase frequency there are a fewmore sophisticated off-the-shelf options that can help us make senseof large arrays of unstructured data Topic modeling is among themost developed as Princetonrsquos David M Blei (a Computer Scienceprofessor) describes in a general introduction to the topic

While more and more texts are available online we simplydo not have the human power to read and study them toprovide the kind of [traditional] browsing experiencedescribed above To this end machine learning researchershave developed probabilistic topic modeling a suite of algo-rithms that aim to discover and annotate large archives ofdocuments with thematic information Topic modelingalgorithms are statistical methods that analyze the words ofthe original texts to discover the themes that run throughthem how those themes are connected to each other andhow they change over time hellip Topic modeling enables usto organize and summarize electronic archives at a scalethat would be impossible by human annotation55

Using the aforementioned SEASR Meandre workbench (or viaMALLET itself and a command-line interface) we can quickly beginto run our born-digital information through the topic-modelingalgorithm The results are notable and come in two formats a wordcloud visualization as well as an XML file containing a line-by-linebreakdown of each topic Taking the aforementioned file 267 (ldquopastand presentrdquo) we can begin to get a sense of the topics this large web-site discusses

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

At a glance we can learn about some of the main themes the struc-ture of Alberta (Alberta towns bridges lakes) the geographic shapeof the area (life land lands areas links grass and interestingly lan-guage) the stories (story knight archives but also interestinglyknight and wife) and so forth The word cloud gives us a sense ofthe most relevant terms that define each topic but we also have a3200-line XML file broken into topics which lets us further refineour inquiry

A last set of examples will help us move from the aggregate levelof hundreds of thousands of lines of text to the individual line inessence creating our own search engine to navigate this collection

There are a few good reasons why we should not use Google todo all of this in our own work First we do not know how Googleworks precisely It is a proprietary black box While the original algo-rithm is available the actual methods that go into determining theranking of search terms is a trade secret Second Google is optimizedfor a number of uses mainly to deliver the most relevant informa-tion to the user Historians and archivists are looking often forspecific information within old websites and it behooves us to exam-ine them through the lens of more specialized tools With thesehesitations in mind I have sought to develop my own rudimentarysearch engine for this born-digital collection

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A two-pronged approach can take us to relevant informationquickly In the following example I am carrying out a project onblack history We quickly search all of the CDC word frequency filesfor the occurrence of the word ldquofreedomrdquo ldquoblackrdquo and ldquolibertyrdquo(these terms of course require some background knowledge thesearch is shaped by the user) obtain a list and then map it onto achart in order to quickly visualize their locations in the collection Forexample in the graph above we see a visualization of a given term inthis case our three search terms As one moves the mouse over eachbar the frequency data for a given term appears We can also zoomand enhance this graph to use it as we wish A demonstration of thiscan be found online at wwwianmilliganca20130118cdc-tool-1

We can then move down into the data In the example aboveldquo047rdquo labeled ldquoblackgoldtxtrdquo contains 31 occurrences of the wordldquoontariordquo Moving to my Mathematica keyword-in-context routinewe can see where these occurrences of the word appear

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

A snapshot of the dynamic finding aid tool

We see that this website focuses mostly on southwestern Ontariowhere many oil fields are If we move forward we can see that it per-tains to the nineteenth century a particular region as well assubsequent memorialization activities (mention of a museum forexample) All this can be done from within the textual client asopposed to going out into individual websites Broad contoursemerge from this distant reading but they also facilitate deeper read-ing by allowing us to promptly realize critical sources

Another model through which we can visualize documents isthrough a viewer primarily focusing on term document frequencyand inverse frequency terms discussed below To prepare this mater-ial we are combining our earlier data about word frequency (that theword ldquoexperiencerdquo appears 919 times perhaps) with aggregate infor-mation about frequency in the entire corpus (ie the wordldquoexperiencerdquo appears 1766 times) The number of times that a termappears in a given website can be defined as tf (term frequency)Another opportunity is inverse document frequency where we canlearn what words are most distinct and unique in a given documentThe formula for this is the oft-cited tf-idf a staple of informationretrieval theory or tf times the logarithm of the number of docu-ments divided into the number of documents where the given termappears At a glance then we can see some of the unique words con-tained within a website if we select ldquoinverse document frequencyrdquo

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

At a very quick glance one can see the major words that make this aunique document This technique can be used on any textual docu-ment

Off-the-shelf tools can facilitate a deeper yet distant reading ofindividual websites as well especially the suite of research tools onwebsites such as Many Eyes and Voyant Using these tools howeverrequires that one puts individual data into each separate programrather than quickly navigating between them This stems from themassive amount of files and data involved Many Eyes limits files to5MB and Voyant has no hard and fast rule but can struggle withlarger files

This analysis was carried out on my personal computer55 Largerdata sets may necessitate a corresponding increase in computationalpower bringing social sciences and humanities researchers into con-tact with High Performance Computing (HPC) centres or consortiaThese extremely powerful computers provide researchers with com-puting power orders of magnitude above that of personal computersallowing results in hours instead of the weeks or months it might

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

take to process on a smaller system56 This has particular applicationin the basic processing of a very large web archive There are limita-tions however jobs must be submitted into queues in advancerather than on-site processing and perhaps most importantly HPCcentres prefer to run highly optimized code to minimize usage oftheir very expensive and scarcely allocated equipment As this level ofcoding expertise is often beyond humanities researchers researchteams must be formed

Humanities researchers would do well to familiarize themselveswith local HPC options many of which are more than happy towork on new and engaging questions These are accessible optionsin Ontario for example all but one university belongs to a HPCconsortium57 Yet as personal computing continues to dramaticallyincrease in speed and as storage prices continue to plummet textualanalysis is increasingly feasible on high-quality office systems An ini-tial exposure can generate questions and problems that if the needrears itself can subsequently be taken to the supercomputer

This case study is intended to provide one model for distantreading and the generation of finding aids in the interest of showingsome of the potential for working with born-digital sources It isintended to begin or jumpstart a conversation around the resourcesand skills necessary for the next generation A seemingly insur-mountable amount of information contained across over 73000files (as in the CDC collection) can be quickly navigated and iso-lated for researcher convenience

Conclusion Digital Literacy for the Next Generation ofHistorians

As a discipline I would like to suggest one path forward prepare thenext generation of historians at least some of them to deal with thedeluge of born-digital sources Fundamentally the challenge of born-digital sources is a pedagogical one For existing historians this is anissue of continual education for future historians an issue of profes-sional training (the annual Digital Humanities Summer Instituteheld at the University of Victoria each June is worth considering asan introduction to many digital concepts) Some off-the-shelf tools

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

can serve as gateways to deeper engagement notably the VoyantTools suite (wwwvoyant-toolsorg) and perhaps subsequently thelessons contained in the Programming Historian 2 That said digitalliteracy courses should become part of the essential pantheon of his-torical curricula This should not be mandatory but instead taughtalongside and within traditional historiography and research meth-ods courses In an era of diminishing resources and attention paid toknowledge translation and relevancy digital literacy courses are alsowell positioned to provide graduates with concrete portfolios andactionable skills If we want to be prepared to use digital method-ologies to study the post-1990 world we will need ready historians

Such a course should have six basic priorities First we must pro-vide historiographical context for new students on whatcomputational methods have been tried before How have quantita-tive and qualitative methods differed yet ultimately complementedeach other Second students in this course would need to develop anunderstanding of the digital basics cloud computing effectivebackup and versioning digital security and an understanding of howthe Internet is organized Third for effective cross-departmental andsustained engagement with digital history students must to learn howto digitally navigate conventional sources notable skill sets wouldinclude a familiarity with citation management software databasesand the possibility of using optical character recognition on a varietyof born-paper sources subsequently digitized through scanners orcameras This must also include a grasp of the scope and significanceof large digital depositories including Google Books the InternetArchive and others mentioned earlier Fourth students must learnhow to process large amounts of digital information this includes off-the-shelf visualization tools material discussed here and some of thephilosophical and methodological implications of a distance readingFifth students can learn basic programming through the ProgrammingHistorian Finally and perhaps most importantly students need anunderstanding of various ways to openly disseminate material on the web whether through blogging micro-blogging publishing webpages and so forth it is important to enable the great work that we do in the university to make its mark on the world Throughthis students will also come away from history departments with

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

comprehensive and marketable digital portfolios Such a coursewould focus more on homework than reading with special emphasisplaced on deliverable content such as websites interactive modelsand broadly implementable databases

Digital history and the readiness to work with born-digitalsources cannot stem out of pedagogy alone however Institutionalsupport is necessary While based on anecdotal evidence it appearsthat some university administrators have begun to embrace the shifttowards the digital humanities At the federal level SSHRC has iden-tified the Digital Economy as one of its priority areas58 Yet when itcomes to evaluation doctoral and post-doctoral researchers do nothave a ldquodigital humanitiesrdquo option open to them in terms of assessingcommittees leading to fears that their projects might be lost inbetween disciplinary areas of focus59 In light of the impending crisisof born-digital sources historians should encourage these initiatives

These sorts of digital projects do not always lend themselveswell to traditional publication venues The examples above from theCDC case study for example are best viewed as dynamically mal-leable models this will be one challenge as historians grapple withnew publication models This article argues first why digitalmethodologies are necessary and second how we can realize aspectsof this call The Internet is 20 years old and before historians realizeit we will be faced with the crisis of born-digital information over-load If history is to continue as the leading discipline inunderstanding the social and cultural past decisive movementtowards the digital humanities is necessary When the sea comes inwe must be ready to swim

IAN MILLIGAN is an assistant professor of Canadian history at theUniversity of Waterloo He is a founding co-editor of ActiveHistorycaan editor-at-large of the Programming Historian 2 and has written sev-eral articles in the fields of Canadian youth digital and labour history

IAN MILLIGAN est professeur adjoint en histoire canadienne agrave laUniversity of Waterloo Il est le fondateur et codirecteur du site

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Internet ActiveHistoryca ainsi qursquoun eacutediteur du site ProgrammingHistorian 2 Il a eacutecrit plusieurs articles au sujet de la jeunesse cana-dienne de la numeacuterisation et de lrsquohistoire des travailleurs

Endnotes

1 The question of when the past becomes a ldquohistoryrdquo to be professionallystudied is a tricky one Yet it is worth noting that the still-definitivework on the Baby Boom Generation appeared in 1996 with DougOwram Born at the Right Time A History of the Baby Boom Generationin Canada (Toronto University of Toronto Press 1996) This actuallyfollowed the earlier scholarly monograph Cyril Levitt Children ofPrivilege Student Revolt in the Sixties (Toronto University of TorontoPress 1984) and a host of other laudable works As academicdebatesworks on the 1960s now suggest the past becomes historysooner than we often expect

2 A critical point advanced in the field of literary criticism and theory byStephen Ramsay Reading Machines Towards an Algorithmic Criticism(Urbana University of Illinois Press 2011)

3 See ldquoIdleNoMore Timelinerdquo ongoingwwwidlenomoremakookcatimeline44 ltviewed 18 January 2013gt

4 These issues have been discussed elsewhere albeit in either a moreabridged form or with a focus on digitized historical sources A notableexception is Daniel J Cohen and Roy Rosenzweig Digital History AGuide to Gathering Preserving and Presenting the Past on the Web(Philadelphia University of Pennsylvania Press 2005) specifically theirchapter on ldquoCollecting History onlinerdquo Their focus is on the preservation-ist and archival challenges however Important notes are also found inDan Cohen ldquoHistory and the Second Decade of the Webrdquo RethinkingHistory 8 no 2 (June 2004) 293ndash301 also available online atwwwchnmgmueduessays-on-history-new-mediaessaysessayid=34

5 Figure taken from Jose Martinez ldquoTwitter CEO Dick Costolo Reveals Staggering Number of Tweets Per Dayrdquo ComplexTech wwwcomplexcomtech201210twitter-ceo-dick-costolo-reveals-staggering-number-of-tweets-per-day ltviewed 6 January 2013gtFurther more while a tweet itself is only 140 characters every tweetcomes with over 40 lines of metadata date information author biogra-phy and location information when the account was created the usersrsquofollowing and follower numbers country data and so forth This is aremarkable treasure trove of information for future historians but also apotential source of information overload See Sarah Perez ldquoThis is What

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

a Tweet Looks Likerdquo ReadWriteWeb (19 April 2010) wwwreadwritewebcomarchivesthis_is_what_a_tweet_looks_likephpltviewed 6 January 2013gt

6 This is further discussed in Ian Foster ldquoHow Computation ChangesResearchrdquo in Switching Codes Thinking Through Digital Technology inthe Humanities and the Arts eds Thomas Bartscherer and RoderickCoover (Chicago University of Chicago Press 2011) esp 22ndash4

7 Franco Moretti Graphs Maps Trees Abstract Models for Literary History(New York Verso 2005) 3ndash4

8 While not of prime focus here the longue dureacutee deserves definition Itstands opposite the history of events of instances covering instead avery long time span in an interdisciplinary social science framework In along essay Braudel noted that one can then see both structural crises ofeconomies and structures that constrain human society and develop-ment For more see Fernand Braudel ldquoHistory and the Social SciencesThe Longue Dureacuteerdquo in Fernand Braudel On History trans SarahMatthews (Chicago University of Chicago Press 1980) 27 The essaywas originally printed in the Annales ESC no 4 (OctoberndashDecember1958)

9 A great overview can be found at ldquoText Analysisrdquo Tooling Up for DigitalHumanities 2011 wwwtoolingupstanfordedupage_id=981 ltviewed19 December 2012gt

10 Jean-Baptiste Michel Yuan Kui Shen Aviva Presser Aiden Adrian VeresMatthew K Gray The Google Books Team Joseph P Pickett DaleHoiberg Dan Clancy Peter Norvig Jon Orwant Steven Pinker MartinA Nowak Erez Lieberman Aiden ldquoQuantitative Analysis of CultureUsing Millions of Digitized Booksrdquo Science 331 (14 January 2011)176ndash82 The tool is publicly accessible at ldquoGoogle Books NgramViewerrdquo Google Books 2010 wwwbooksgooglecomngrams ltviewed11 January 2013gt

11 Anthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

12 Comment by Jean-Baptiste Michel and Erez Lieberman Aiden onAnthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

13 For more information see the Canadian Century Research InfrastructureProject website at wwwccriuottawacaCCRIHomehtml and the Programme de recherche en deacutemographie historique at wwwgeneologyumontrealcafrleprdhhtm

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

14 ldquoWith Criminal Intentrdquo CriminalIntentorg wwwcriminalintentorgltviewed 20 December 2012gt

15 For an overview see ldquoAward Recipients mdash 2011rdquo Digging into DataWebpage wwwdiggingintodataorgHomeAwardRecipients2011tabid185Defaultaspx ltviewed 20 December 2012gt

16 One can do no better than the completely free Programming Historianwhich uses entirely free and open-source programming tools to learn thebasics of computational history See The Programming Historian 2 2012-Present wwwprogramminghistorianorg ltviewed 11 January 2013gt

17 Michael Katz The People of Hamilton Canada West Family and Class ina Mid-Nineteenth-Century City (Cambridge MA Harvard UniversityPress 1975) Also see A Gordon Darroch and Michael D OrnsteinldquoEthnicity and Occupational Structure in Canada in 1871 The VerticalMosaic in Historical Perspectiverdquo Canadian Historical Review 61 no 3(1980) 305ndash33

18 Ian Anderson ldquoHistory and Computingrdquo Making History The ChangingFace of the Profession in Britain 2008 wwwhistoryacukmakinghistoryresourcesarticleshistory_and_computinghtml ltviewed 20 December2012gt

19 Christopher Collins ldquoText and Document Visualizationrdquo TheHumanities and Technology Camp Greater Toronto Area VisualizationBootcamp University of Ontario Institute of Technology Oshawa Ont21 October 2011

20 The quotations by Luther and Poe are commonly used to make thispoint For the best example see Clay Shirky Cognitive SurplusCreativity and Generosity in a Connected Age Google eBook Penguin2010 For Mumford see Lewis Mumford The Myth of the Machine vol2 The Pentagon of Power (New York Harcourt Brace 1970) 182 asquoted in Gleick The Information A History A Theory A Flood (NewYork Pantheon 2011) 404

21 Adam Ostrow ldquoCisco CRS-3 Download Library of Congress in JustOver One Secondrdquo Mashable (9 March 2010) accessed onlinewwwmashablecom20100309cisco-crs-3

22 A note on data size is necessary The basic unit of information is a ldquobyterdquo (8 bits) which is roughly a single character A kilobyte (KB) is athousand bytes a megabyte (MB) a million a gigabyte (GB) a billion aterabyte (TB) a trillion and a petabyte (PB) a quadrillion Beyond thisthere are exabytes (EB) zettabytes (ZB) and yottabytes (YB)

23 These figures can of course vary depending on compression methodschosen Their math was based on an average of 8MB per book if it wasscanned at 600dpi and then compressed There are 26 million books inthe Library of Congress For more see Mike Ashenfelder ldquoTransferringlsquoLibraries of Congressrsquo of Datardquo Library of Congress Digital Preservation

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Blog 11 July 2011wwwblogslocgovdigitalpreservation201107transferring-libraries-of-congress-of-data ltviewed 11 January 2013gt

24 Gleick 232 25 ldquoAbout the Library of Congress Web Archivesrdquo Library of Congress Web

Archiving FAQs Undated wwwlocgovwebarchivingfaqhtmlfaqs_02ltviewed 11 January 2013gt

26 ldquo10000000000000000 Bytes Archivedrdquo Internet Archive Blog 26October 2012 wwwblogarchiveorg2012102610000000000000000-bytes-archivedltviewed 19 December 2012gt

27 Jean-Baptiste Michel et al 176ndash82 Published Online Ahead of Print16 December 2010 wwwsciencemagorgcontent3316014176ltviewed 29 July 2011gt See also Scott Cleland ldquoGooglersquos PiracyLiabilityrdquo Forbes 9 November 2011 wwwforbescomsitesscottcle-land20111109googles-piracy-liability ltviewed 11 January 2013gt

28 ldquoGovernment of Canada Web Archiverdquo Library and Archives Canada 17October 2007 wwwcollectionscanadagccawebarchivesindex-ehtmlltviewed 11 January 2013gt

29 For background see Bill Curry ldquoVisiting Library and Archives inOttawa Not Without an Appointmentrdquo Globe and Mail 1 May 2012wwwtheglobeandmailcomnewspoliticsottawa-notebookvisiting-library-and-archives-in-ottawa-not-without-an-appointmentarticle2418960 ltviewed 22 May 2012gt and the announcement ldquoLAC Begins Implementation of New Approach to service Deliveryrdquo Library and Archives Canada February 2012wwwcollectionscanadagccawhats-new013-560-ehtml ltviewed 22May 2012gt For professional responses see the ldquoCHArsquos comment onNew Directions for Library and Archives Canadardquo CHA Website March2012 wwwcha-shccaenNews_39items19html ltviewed 19December 2012gt and ldquoSave Library amp Archives Canadardquo CanadianAssociation of University Teachers launched AprilMay 2012 wwwsavelibraryarchivesca ltviewed 19 December 2012gt

30 Ian Milligan ldquoThe Smokescreen of lsquoModernizationrsquo at Library andArchives Canadardquo ActiveHistoryca 22 May 2012wwwactivehistoryca201205the-smokescreen-of-modernization-at-library-and-archives-canada ltviewed 19 December 2012gt

31 ldquoSave LAC May 2012 Campaign Updaterdquo SaveLibraryArchivesca May2012 wwwsavelibraryarchivescaupdate-2012-05aspx ltviewed 19December 2012gt

32 The challenges and opportunities presented by such projects is extensively discussed in Daniel J Cohen ldquoThe Future of Preserving the Pastrdquo CRM The Journal of Heritage Stewardship 2 no 2

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

(Summer 2005) 6ndash19 also available freely online at the RoyRosenzweig Center for History and New Mediawwwchnmgmueduessays-on-history-new-mediaessaysessayid=39For more see ldquoSeptember 11 Digital Archive Saving the Histories of September 11 2001rdquo Centre for History and New Media andAmerican Social History Project Center for Media and Learningwww911digitalarchiveorg ltviewed 11 January 2013gt ldquoHurricaneDigital Memory Bank Collecting and Preserving the Stories of Katrina and Ritardquo Center for History and New Media wwwhurricanearchiveorg ltviewed 11 January 2013gt and ldquoOccupyArchive Archiving the Occupy Movements from 2011rdquo Centre forHistory and New Media wwwoccupyarchiveorg ltviewed 11 January2013gt

33 Daniel Cohen and Roy Rosenzweig Digital History A Guide to Gathering Preserving and Presenting the Past on the Web(Philadelphia Temple University Press 2005) chapter on ldquoPreservingDigital Historyrdquo The chapter is available fully online atwwwchnmgmuedudigitalhistorypreserving5php ltviewed 19December 2012gt

34 For an overview see ldquoPreserving Our Digital Heritage The NationalDigital Information Infrastructure and Preservation Program 2010 Report a Collaborative Initiative of the Library of CongressrdquoDigitalpreservationgov wwwdigitalpreservationgovmultimediadocumentsNDIIPP2010Report_Postpdf ltviewed 19 December 2012gt

35 Steve Meloan ldquoNo Way to Run a Culturerdquo Wired 13 February 1998available online via the Internet Archivewwwwebarchiveorgweb20000619001705httpwwwwiredcomnewsculture012841030100html ltviewed 19 December 2012gt

36 Some of this has been discussed in greater depth on my website See theseries beginning with Ian Milligan ldquoWARC Files A Challenge forHistorians and Finding Needles in Haystacksrdquo ianmilliganca 12December 2012 wwwianmilliganca20121212warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks ltviewed 20 December2012gt

37 Unstructured data as opposed to structured data is important for histo-rians but brings additional challenges Some projects have producedstructured data novels marked up using XML encoding for example soone can immediately see place names dates character names and soforth This makes it much easier for a computer to read and understand

38 See William Noblett ldquoDigitization A Cautionary Talerdquo New Review ofAcademic Librarianship 17 (2011) 3 Tobias Blanke Michael Bryantand Mark Hedges ldquoOcropodium Open Source OCR for Small-ScaleHistorical Archivesrdquo Journal of Information Science 38 no 1 (2012) 83

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 6: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

find that relevant letter rather than spending hours searching andreading everything Yet we also have the ability to pull our gaze backdistantly elucidating the context to our relevant documents as well

The Necessity of Distant Reading Historians andComputational History

Franco Moretti in his 2005 groundbreaking work of literary criti-cism Graphs Maps Trees called on his colleagues to rethink theircraft With the advent of mass produced novels in nineteenth-cen-tury Britain Moretti argued that scholars must change theirdisciplinary approach Traditionally literary scholars worked with acorpus of around 200 novels though an impressive number for theaverage person it represents only around one percent of the totaloutput of that period To work with all of this material in an effortto understand these novels as a genre critics needed to approachresearch questions in a different way

[C]lose reading wonrsquot help here a novel a day every day ofthe year would take a century or so hellip And itrsquos not evena matter of time but of method a field this large cannotbe understood by stitching together separate bit of knowl-edge about individual cases because it isnrsquot a sum ofindividual cases itrsquos a collective system that should begrasped as such as a whole7

Instead of close reading Moretti called for literary theorists to prac-tice ldquodistant readingrdquo

For inspiration Moretti drew on the work of French Annaleshistorian Fernand Braudel Braudel amongst the most instrumentalhistorians of the twentieth century pioneered a similarly distantapproach to history his inquiries spanned large expanses of time andspace that of civilizations the Mediterranean world as well assmaller (only by comparison) regional histories of Italy and FranceHis distant approach to history did not necessitate disengagementwith the human actors on the ground seeing instead that beyond thenarrow focus of individual events lay the constant ebb and flow of

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

poverty and other endemic structural features8 While his methodol-ogy does not apply itself well to rapidly changing societies (his focuswas on the long-term slow change) his points around distant read-ing and the need for collaborative interdisciplinary research are auseful antecedent Similar opportunities present themselves whendealing with the vast array of born-digital sources (as well as digitizedhistorical sources) we can distantly read large arrays of sources froma longue dureacutee of time and space mediated through algorithms andcomputer interfaces opening up the ability to consider the voices ofeverybody preserved in the source

Historians must be open to the digital turn due to the astound-ing growth of digital sources and an increasing technical ability toprocess them on a mass scale Social sciences and humanitiesresearchers have begun to witness a profound transformation in howthey conduct and disseminate their work Indeed historical digitalsources have reached a scale where they defy conventional analysisand now require computational analysis Responding to thisarchives are increasingly committed to preserving cultural heritagematerials in digital forms

I would like to pre-emptively respond to two criticisms of thisdigital shift within the historical profession First historians will notall have to become programmers Just as not all historians need a firmgrasp of Geographical Information Systems (GIS) or a developedunderstanding of the methodological implications of community-based oral history the incorporation of a transnational perspective orin-depth engagement with cutting edge demographic models not allhistorians have to approach their trade from a computational per-spective Nor should they Digital history does not replace closereading traditional archival inquiry or going into communities mdash touse only a few examples mdash to uncover notions of collective memoryor trauma Indeed digital historians will play a facilitative role andprovide a broader reading context Yet they will still be historians col-lecting relevant sources analyzing and contextualizing them situatingthem in convincing narratives or explanatory frameworks and dis-seminating their findings to wider audiences

Neither will historians replace professional programmers Insteadas with other subfields historians using digital sources will need to be

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

prepared to work on larger truly interdisciplinary teams Computer sci-entists statisticians and mathematicians for example have criticalmethodological and analytical contributions to make towards questionsof historical inquiry Historians need to forge further links with thesedisciplines Such meaningful collaboration however will require histo-rians who can speak their languages and who have an understanding oftheir work Some of us must be able to discuss computational or algo-rithmic methodologies when we collectively explore the past

The methodologies discussed in this paper fall into the twinfields of data mining and textual analysis The former refers to thediscovery of patterns or information about large amounts of infor-mation while the latter refers to the use of digital tools to analyzelarge quantities of text9 The case study here draws upon three mainconcepts information retrieval (quickly finding what you want)information and term frequency (how often do various concepts andterms appear) and topic modeling (finding frequently occurringbags of words throughout documents) Other projects have involvedauthorship attribution which is finding out the author of docu-ments based on previous patterns style analysis or creating networkgraphs of topics groups documents or individuals

The Culturomics project run by the Cultural Observatory atHarvard University is a notable model of collaboration on a big his-tory project As laid out in Science and put online as the GoogleBooks n-gram viewer (wwwbooksgooglecomngrams) this projectindexed word and phrase frequency across over five million booksenabling researchers to trace the rise and fall of cultural ideas andphenomena through targeted keyword and phrase searches10 This isthe most ambitious and certainly the most advanced and accessibleBig History project yet launched Yet amongst the extensive authorlist (13 individuals and the Google Books team) there were no his-torians This suggests that other disciplines in the humanities andsocial sciences have embraced the digital turn far more than histori-ans However as I mentioned in the coming decades historians willincreasingly rely on digital archives There is a risk that history as aprofessional discipline could be left behind

American Historical Association President Anthony Graftonnoted this absence in a widely circulated Perspectives article Why was

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

there no historian he asked when this was both primarily a histori-cal project and a team comprised of varied doctoral holders fromdisciplines such as English literature psychology computer sciencebiology and mathematics12 Culturomics project leaders ErezLeiberman Aiden and Jean-Baptiste Michel responded franklyWhile the team approached historians and some did play advisoryroles every author listed ldquodirectly contributed to either the creationof the collection of written texts (the lsquocorpusrsquo) or to the design andexecution of the specific analyses we performed No academic histo-rians met this barrdquo They continued to indict the profession

The historians who came to the meeting were intelligentkind and encouraging But they didnrsquot seem to have agood sense of how to wield quantitative data to answerquestions didnrsquot have relevant computational skills anddidnrsquot seem to have the time to dedicate to a big multi-author collaboration Itrsquos not their fault these things donrsquotappear to be taught or encouraged in history departmentsright now13

This is a serious indictment considering the pending archival shift todigital sources and one that still largely holds true (a few historydepartments now offer digital history courses but they are still out-liers) This form of history the massive analysis of digitizedinformation will continue Furthermore these approaches willattract public attention as they lend themselves well to accessibleonline deployment Unless historians are well positioned they riskceding professional ground to other disciplines

While historians were not involved in Culturomics they arehowever involved in several other digital projects Two major andambitious long-term Canadian projects are currently using compu-tational methods to reconstruct the demographic past The first isthe Canadian Century Research Infrastructure project which is cre-ating databases from five censuses and aiming to provide ldquoa newfoundation for the study of social economic cultural and politicalchangerdquo while the second is the Programme de recherche en deacutemo-graphie historique hosted by the Universiteacute de Montreacuteal13 They are

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

reconstructing the European population of Quebec in the seven-teenth and eighteenth centuries drawing heavily on parish registersFurthermore several projects in England have harnessed the wave ofinterest in ldquobig datardquo to tell compelling historical stories about theirpasts the Criminal Intent project for example used the proceedingsof the Old Bailey courts (the main criminal court in Greater London)to visualize large quantities of historical information14 The CriminalIntent project was itself funded by the Digging into Data Challengewhich brings together multinational research teams and fundingfrom federal-level funding agencies in Canada the United States theUnited Kingdom and beyond Several other historical studies havebeen funded under their auspices from studies of commodity tradein the eighteenth- and nineteenth-century Atlantic world (involvingscholars at the University of Edinburgh the University of StAndrews and York University) to data mining the degrees of eco-nomic opportunity and spatial mobility in Britain Canada and theUnited States (bringing researchers at the University of Guelph theUniversity of Leicester and the University of Minnesota together)15

These exciting interdisciplinary teams may be heralding a disciplinalshift although it is still early days

Collaborative teams have limitations though We will have tobe prepared to tackle digital sources alone on shoestring budgetsHistorians will not always be able to call on external programmers orspecialists to do work because teams often develop out of fundinggrants As governmental and university budgets often prove unableto meet consistent research needs baseline technical knowledge isrequired for independent and unfunded work If historians relyexclusively on outside help to deal with born-digital sources the abil-ity to do computational history diminishes As the success of theProgramming Historian indicates and as I will demonstrate in mycase study below historians can learn basic techniques16 Let ustweak our own algorithms

A social or cultural history of the very late-twentieth or twenty-first century will have to account for born-digital sources andhistorians will (in conjunction with archival professionals) have tocurate and make sense of the data Historians need to be able toengage with these sources themselves or at least as part of a team

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

rather than having sources mediated through the lens of outside dis-ciplines or commercial interests Algorithms may be one meaningfulway to make sense of cultural trends or extract meaningful informa-tion from billions of tweets but we must ensure professionalhistorical input goes into their crafting

The Third Wave of Computational History

Historians have fruitfully used computers in the past as part of twoprevious waves The 1960s and 1970s saw computational tools usedfor demographic population and economic histories a trend that inCanada saw full realization in comprehensive studies such as MichaelKatzrsquos The People of Hamilton Canada West17 As Ian Anderson hasconvincingly noted in a review of history and computing studies ofthis sort saw the field become associated with quantitative studies18

After a retreat from the spotlight computational history rose again inthe 1990s with the advent of the personal computer graphical userinterfaces and overall improvements in ease of use Punch cards couldnow give way to database programs GIS and early online networkssuch as H-Net and Usenet We are now I believe on the cusp of athird revolution in computational history thanks to three main fac-tors decreasing storage costs the power of the Internet anddistributed cloud computing and the rise of professionals dealingwith both digital preservation and open-source tools

Decreasing storage costs have led to massive amounts of infor-mation being preserved as Gleick noted in this articlersquos epigraph Itis important to note that this information overload is not newPeople have long worried about the impact of too much informa-tion19 In the sixteenth century responding to the rise of the printingpress the German priest Martin Luther decried that the ldquomultitudeof books [were] a great evilrdquo In the nineteenth century Edgar AllanPoe bemoaned ldquo[t]he enormous multiplication of books in everybranch of knowledge is one of the greatest evils of this agerdquo Asrecently as 1970 American historian Lewis Mumford lamented ldquotheoverproduction of books will bring about a state of intellectual ener-vation and depletion hardly to be distinguished from massiveignorancerdquo20 The rise of born-digital sources must thus be seen in

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

this continuous context of handwringing around the expansion andrise of information

Useful historical information is being preserved at a rate that isaccelerating with every passing day One useful metric to comparedataset size is that of the American Library of Congress (LOC)Indeed a ldquoLOCrdquo has become shorthand for a unit of measurementin the digital humanities Measuring the actual extant of the printedcollection in terms of data is notoriously difficult James Gleick hasclaimed it to be around 10TB22 whereas the Library of Congressitself claims a more accurate figure is about 200TB23 While thesetwo figures are obviously divergent neither is on an unimaginablescale mdash personal computers now often ship with a terabyte or moreof storage When Claude Shannon the father of information theorycarried out a thought experiment in 1949 about items or places thatmight store ldquoinformationrdquo (an emerging field which viewed messagesin their aggregate) he scaled sites from a digit upwards to the largestcollection he could imagine The list scaled up logarithmically from10 to the power of 0 to 10 to the power of 13 If a single-spaced pageof typing is 104 107 the Proceedings of the Institute of Radio Engineers109 the entire text of the Encyclopaedia Brittanica then 1014 was ldquothelargest information stockpile he could think of the Library ofCongressrdquo24

But now the Library of Congressrsquo print collection mdash somethingthat has taken two centuries to gather mdash is dwarfed by their newestcollection of archived born-digital sources the vast majority only amatter of years old compared to the much wider date-range of non-digital sources in the traditional collection The LOC has beguncollecting its own born-digital Internet archive multiple snapshotsare taken of webpages to show how they change over time Even asa selective curated collection drawing on governmental politicaleducational and creative websites the LOC has already collected254TB of data and adds 5TB a month25 The Internet Archivethrough its Wayback Machine is even more ambitious It seeks toarchive every website on the Internet While its size is also hard toput an exact finger on as of late 2012 it had over 10 Petabytes ofinformation (or to put it into perspective a little over 50 Library ofCongresses if we take the 200TB figure)26

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

This is to some degree comparing apples and oranges theLibrary of Congress is predominantly print whereas the InternetArchive has considerable multimedia holdings which includesvideos images and audio files The LOC is furthermore a morecurated collection whereas the Internet Archive draws from a widerrange of producers The Internet offers the advantages and disadvan-tages of being a more democratic archive For example a Geocitiessite (an early way for anybody to create their own website) created bya 12-year old Canadian teenager in the mid-1990s might be pre-served by the Internet Archive whereas it almost certainly would notbe saved by a national archive This difference in size and collectionsmanagement however is at the root of the changing historiansrsquotoolkit If we make use of these large data troves we can access a newand broad range of historical subjects

Conventional sources are also part of this deluge Google Bookshas been digitizing the printed word and it currently has 15 millionworks completed by 2020 it audaciously aims to digitize every bookever written While copyright issues loom the project will allowresearchers to access aggregate word and phrase data as seen in theGoogle n-gram project27 In Canada Library and Archives Canada(LAC) has a more circumscribed digitization project collecting ldquoarepresentative sample of Canadian websitesrdquo focusing particularlyon Government of Canada webpages as well as a curated approach

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

preserving case studies such as the 2006 federal election Currentlyit has 4TB of federal government information and 7TB of totalinformation in its larger web archive (a figure which as I note belowis unlikely to drastically improve despite ostensible goals of archivallsquomodernizationrsquo)28

In Canada LAC has mdash on the surface mdash has been advancingan ambitious digitization program as it realizes the challenges nowfacing archivists and historians Daniel J Caron the Librarian andArchivist of Canada has been outspoken on this front with severalpublic addresses on the question of digital sources and archivesThroughout 2012 primary source digitization at LAC preoccupiedcritics who saw it as a means to depreciate on-site access TheCanadian Association of University Teachers the Canadian HistoricalAssociation as well as other professional organizations mountedcampaigns against this shift in resources29 LAC does have a pointhowever with their modernization agenda The state of Canadarsquosonline collections are small and sorely lacking when compared totheir on-site collections and LAC does need to modernize and achievethe laudable goal of reaching audiences beyond Ottawa

Nonetheless digitization has unfortunately been used as asmokescreen to depreciate overall service offerings30 The modern-ization agenda would also see 50 percent of the digitization staffcut31 LACrsquos online collections are small they do not work withonline developers through Application Programming Interfaces(APIs a way for a computer program to talk to another computerand speed up research mdash Canadianaca is incidentally a leader in thisarea) and there have been no promising indications that this state ofaffairs will change Digitization has not been comprehensive Onlinepreservation is important and historians must fight for both on-siteand on-line access

A number of historians have recognized this challenge and areplaying an instrumental role in preserving the born-digital past TheRoy Rosenzweig Center for History and New Media at GeorgeMason University launched several high-profile preservation pro-jects the September 11th Digital Archive the Hurricane DigitalMemory Bank (focusing on Hurricanes Katrina and Rita) and as ofwriting the Occupy archive32 Massive amounts of online content

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

is curated and preserved photographs news reports blog posts andnow tweets These complement more traditional efforts of collectingand preserving oral histories and personal recollections which arethen geo-tagged (tied to a particular geographical location) tran-scribed and placed online These archives serve a dual purposepreserving the past for future generations while also facilitating easydissemination for todayrsquos audiences

Digital preservation however is a topic of critical importanceMuch of our early digital heritage has been lost Many websites espe-cially those before the Internet Archiversquos 1996 web archiving projectlaunch are completely gone Early storage mediums (from tape toearly floppy disks) have become obsolete and are now nearly inac-cessible Compounding this software development has seen earlyproprietary file formats depreciated ignored and eventually madewholly inaccessible33 Fortunately the problem of digital preservationwas recognized by the 1990s and in 2000 the American governmentestablished the National Digital Information Infrastructure andPreservation Program34 As early as 1998 digital preservationistsargued ldquo[h]istorians will look back on this era hellip and see a period ofvery little information A lsquodigital gaprsquo will span from the beginningof the wide-spread use of the computer until the time we eventuallysolve this problem What wersquore all trying to do is to shorten thatgaprdquo To bear home their successes that quotation is taken from anInternet Archive snapshot of Wired magazinersquos website35

Indeed greater attention towards rigorous metadata (which isbriefly put data about data often preserved in plain text format)open-access file formats documented by the InternationalOrganization for Standardization (ISO) and dedicated digitalpreservation librarians and archivists gives us hope for the future36

It is my belief that cloud storage the online hosting of data in largethird-party offsite centres mitigates the issue of obsolete storagemedium Furthermore that the move towards open source file for-mats even amongst large commercial enterprises such as Microsoftprovides greater documentation and uniformity

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A Brief Case Study Accessing and Navigating Canadarsquos DigitalCollections With an Emphasis on Tools for Born-DigitalSources

What can we do about this digital deluge There are no simpleanswers but historians must begin to conceptualize new additions totheir traditional research and pedagogical toolkits In the section thatfollows I will do two things introduce a case study and several toolsboth proprietary and those that are open and available off-the-shelfto reveal one way forward

For a case study I have elected to use Canadarsquos DigitalCollections (CDC) a collection of lsquodeadrsquo websites removed fromtheir original location on the World Wide Web and now digitallyarchived at LAC In this section I will introduce the collection usingdigitally-archived sources from the early World Wide Web and dis-cuss my research methodology before illustrating how we can derivemeaningful data from massive arrays of unstructured data37

This is a case study of a web archive Several of these method-ologies could be applied to other formats including large arrays oftextual material whether born-digital or not A distant reading of thistype is not medium specific It does however assume that the text ishigh quality Digitized primary documents do not always have textlayers and if they are instituted computationally through OpticalCharacter Recognition errors can appear Accuracy rates for digitizedtextual documents can range from as low as 40 percent for seven-teenth and eighteenth century documents to 80ndash90 percent formodern microfilmed newspapers to over 99 percent for typesetword-processed documents38 That said while historians today maybe more interested in applying distant reading methodologies tomore traditional primary source materials it is my belief that in thenot-so-distant future websites will be an increasingly significantsource for social cultural and political historians

What are Canadarsquos Digital Collections In 1996 IndustryCanada funded by the federal Youth Employment Strategy facili-tated the development of Canadian heritage websites by youth(defined as anyone between the age of 15 and 30) It issued contractsto private corporations that agreed to use these individuals as web

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

developers for the twofold purpose of developing digital skills andinstilling respect for and engagement with Canadian heritage andpreservation39 The collection initially called SchoolNet butrebranded as Canadarsquos Digital Collections in 1999 was accessiblefrom the CDC homepage After 2004 the program was shut downand the vast majority of the websites were archived on LACrsquos web-page This collection is now the largest born-digital Internet archivethat LAC holds worthy of study as content as well as thinkingabout methodological approaches

Conventional records about the program are unavailable in thetraditional LAC collection unsurprising in light of both its recentcreation and backlogs in processing and accessioning archival mate-rials We can use one element of our born-digital toolkit the InternetArchiversquos WaybackMachine (its archive of preserved early webpages)to get a sense of what this project looked like in its heyday The mainwebpage for the CDC was wwwcollectionsicgc However a visitortoday will see nothing but a collection of broken image links (two ofthem will bring you to the English and French-language versions ofthe LAC-hosted web archive although this is only discerniblethrough trial and error or through the sitersquos source code)

With the WaybackMachine we have a powerful albeit limitedopportunity to ldquogo back in timerdquo and experience the website as previ-ous web viewers would have (if a preserved website is available ofcourse) Among the over 420 million archived websites we can searchfor and find the original CDC site Between 12 October 1999 andthe most recent snapshot or web crawl (of the now-broken site) on 6July 2011 the site was crawled 410 times40 Unfortunately every siteversion until the 15 October 2000 crawl has been somehow cor-rupted Content has been instead replaced by Chinese characterswhich translate only into place names and other random words Issuesof early digital preservation noted above have reared their head inthis collection

From late 2000 onward however fully preserved CDC sites areaccessible Featured sites new entries success stories awards curric-ular units and prominent invitations to apply for project fundingpopulate the brightly coloured website The excitement is palpableJohn Manley then Industry Minister declared ldquoit is crucial that

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Canadarsquos young people get the skills and experience they need to par-ticipate in the knowledge-based economy of the futurerdquo and thatCDC ldquoshows that young people facing significant employment chal-lenges can contribute to and learn from the resources available onthe Information Highwayrdquo41 The documentation reflecting thelower level of technical expertise and scarcity of computer hardwareat the time was comprehensive participants were encouraged toidentify a collection digitize it process it (by cleaning up the data)preserve it on a hard drive and display it on the web Issues of copy-right were discussed as well as preservation issues and hardwarerequirements for computers (a computer with an Intel 386 processor8MB RAM a 1GB hard drive and 288bps modem was recom-mended) and training information for employers educators andparticipants42 Sample proposals community liaison ideas amidstother information set out a very comprehensive roadmap for a grass-roots bottom-up approach to digital preservation and heritage

Moving through the preserved websites is an interesting snap-shot at how web design standards changed Colour disappearsreplaced by a more subdued coordinated federal website standard(still largely in effect today) Surveys greet viewers from 2003onwards trying to learn how viewers discovered the site whetherthey think it is a good educational resource (they must have beenrelieved that the majority thought so mdash in 2003 some 84 percentfelt that it was a ldquogoodrdquo resource the best of its type)43 By 2004 theprogram was engaging in active remembrance of its past programsas it neared its ten-year anniversary Proposals stopped beingaccepted in July 2004 but the 600 webpages were still accessiblewith a comprehensive subject index alphabetical listing success sto-ries and detailed copyright information

The Canadarsquos Digital Collections site began to move towardsremoval in November 2006 and was initially completely gone A visi-tor to the site would learn that the content ldquois no longer availablerdquo andan e-mail address was provided for contributors to get their informationback44The information was completely inaccessible and by early 2007a bold notice announced that the now-empty site would close as of 2April 2007 Luckily by April 2007 a notice appeared with a declara-tion that the collections were now archived by Library and Archives

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Canada just in time for the website to completely disappear by July2007 In the archived version subject indexes and headings were goneAll that remained then and today for that matter is a straight-forwardalphabetical list and a disclaimer that noted ldquo[y]ou are viewing a doc-ument archived by Library and Archives Canada Please noteinformation may be out of date and some functionality lostrdquo

The question now is what we can do with this collection ofwebsites Could a distant reading as opposed to a close reading ona targeted level tell us something about this collectionFortunately we have a variety of tools that can help us navigate thislarge array of information several of which I use to navigate largeamounts of data First I create my own tools through computerprogramming This work is showcased in the following section I program in Wolfram Researchrsquos Mathematica 9 Mathematica is anintegrated platform for technical computing that allows users toprocess visualize and interact with exceptional arrays of informa-tion45 It is a language with an engaged and generous user base thatcomplements exhaustive well-written documentation fromWolfram Research To emphasize its ease one line of code couldimport a website in perfectly readable plaintext (avoiding unicodeerrors) as follows input=Import[ldquohttpactivehis-torycaldquordquoPlaintextrdquo] and a second one could make itall in lower text for textual analysis as such lowerpage=ToLowerCase[input] Variables are a critical building blockIn the former example ldquoinputrdquo now holds the entire plaintext of thewebsite ActiveHistoryca and then ldquolowerpagerdquo holds it in lowercaseOnce we have lower-cased information we are quickly able to extractthe following frequent words frequent phrases and could thus cre-ate a dynamic database of n-grams word clouds or run other textualanalysis programs on the material Much of the syntax adheres to arecognizable English-language format

Even more promising is Mathematicarsquos integration with theWolfram|Alpha database (itself accessible at wolframalphacom) acomputational knowledge engine that promises to put incredibleamounts of information at a userrsquos fingertips With a quick com-mand one can access the temperature at Torontorsquos Lester B PearsonInternational Airport at 500 pm on 18 December 1983 the rela-

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

tive popularity of names such as ldquoEdithrdquo or ldquoEmmardquo the Canadianunemployment rate in March 1985 or the Canadian GrossDomestic Product in December 199546 This allows one to quicklycross-reference findings with statistical information or find coinci-dental points of comparison As universities are now being activelyencouraged to contribute their information to Wolfram|Alpha thiswill become an increasingly powerful database

For many researchers the downside of Mathematica is a con-siderable one It is proprietary software and relatively expensive forstudent and non-student alike Outputs however can be dissemi-nated using the Computational Document Format (CDF) enablinghistorians to put their work online and entice the general public todynamically work with their models and findings

Programming is not the be all and end all Information visual-ization in the humanities is a growing international field withCanadian scholars heavily involved The most important and acces-sible example is Voyant Tools47 It is a powerful suite of textualanalysis tools Uploading onersquos own text or collection of texts draw-ing on a webpage or analyzing existing bodies of work one sees thefollowing a ldquoword cloudrdquo of a corpus click on any word to see itsrelative rise and fall over time (word trends) access the original textand click on any word to see its contextual placement within indi-vidual sentences It ably facilitates the ldquodistancerdquo reading of anyreasonably-sized body of textual information and should be a begin-ning point for those interested in this field The only downside isthat Voyant can choke on very large amounts of information as it isa textual analysis rather than robust data mining tool

There are several other complementary tools A notable one is theMany Eyes website hosted by IBM Research Many Eyes allows you toupload datasets be they textual tabular or numerical There are manyvisualization options provided you can plot relationships betweendata compare values (ie bar charts and bubble charts) track dataover time and apply information to maps The only downside toMany Eyes is that you must register and make your datasets available toall other registered researchers as well as IBM This limits your abilityto use copyrighted or proprietary information

More advanced technical users can avail themselves of a host of

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

textual analysis programs found elsewhere on the web The mostcomprehensive gathering of these tools can be found with theSoftware Environment for the Advancement of Scholarly Research(SEASR) and its Meandre workbench Meandre which can beinstalled as a local server on your system or mdash in a best case scenariomdash hosted by an institutional IT department allows you to tap intoa variety of workflows to analyze data48 One workflow mentionedin this article is the topic modeling toolkit found within the MachineLearning for Language Toolkit (MALLET) package which clustersunstructured data into ldquotopicsrdquo which appear together giving a quicksense of a documentrsquos structure49 Other workflows found withinSEASR include Dunningrsquos log-likelihood statistic which compares aselected document to a broader reference corpus For example youcould compare the lyrics that appear in 1968 against all lyrics in thepostwar era and learn what makes 1968 most unique50 sentimentanalyses which attempt to plot changes in emotional words andphrases over time and a variety of entity extraction algorithms(which find ldquopeoplerdquo ldquoplacesrdquo and so forth in a text) Corpuses thatwere previously too big to extract meaningful information from cannow be explored relatively quickly

As befits a technical side to history all of the freely availabletools discussed briefly here (notably wget SEASR and MALLET)are discussed in more technical terms in the peer-reviewedProgramming Historian 2 or on my own blog at ianmilliganca I haveelected to leave them online rather than provide commands in thisarticle they require some technical knowledge and directly workingfrom the web is quicker than here Please consider the following sec-tion an applied version of some of the lessons you can learn throughthese processes

From Web Collection to Dynamic Finding Aids

Let us briefly return to our archival metaphor Todayrsquos historiandealing with conventional sources often has to physically travel to anarchive requisition boxes perhaps take photographs of materials iftime is tight and otherwise take notes With born-digital sourceshowever the formula is reversed Our sources can come to us In a

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

few keystrokes we can bring an unparalleled quantity of informationto our home or work computers With Canadarsquos Digital Collectionsa single command line entry brought the entire archive onto my sys-tem it took days to process the information and I had to leave mycomputer on but it did not require any active user engagement

How big is the CDC collection in terms of sheer informationThe plaintext without any HTML markup is approximately7805135 words If each page contained 250 words it would be theequivalent of a 31220 page essay or 1000 dissertations However itis even more complicated than just reading 1000 dissertations Theinformation is spread across 73585 separate HTML files themselvespart of a much broader corpus of 278785 files Even the sheer sizeof the database requires big data methodologies an aggregate size of726 GB 360 MB of plaintext with HTML encoding reduced to alsquomerersquo 50MB if all encoding is removed A 50MB plain text docu-ment can quickly slow down even a relatively powerful personalcomputer There is no simple way to make this a conventionallyaccessible amount of information

How can we access this information Several different methodsare possible First we can work with the information on the internetitself or we can download it onto our own system I prefer the lattermethod as it decreases computational time adds flexibility andenables free rein to experiment with material Before you do thisthough you need to be familiar with the websitersquos ldquoterms of servicerdquoor the legal documentation that outlines how users are allowed to usea website resources or software In this case LAC provides generousnon-commercial reproduction terms information ldquomay be repro-duced in part or in whole and by any means without charge orfurther permission unless otherwise specifiedrdquo51

The CDC web archive index is organized like so the index islocated at wwwepelac-bacgcca100205301iccdcEAlphabetaspwith every subsequent website branching off the cdc folder Forexample one website would be hosted in wwwepelac-bacgcca100205301iccdcABbenevolat the next in hellipcdcabnatureand so on Using wget a free software program you can downloadall websites in a certain structure the program recursively downloadsinformation in a methodical fashion I first used wget to save the

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

cookies (little files kept on your system in this case one that notedthat you had seen the disclaimer about the archival nature of thewebsite) and then with an one-line command in my command lineto download the entire database52 Due to the imposition of a ran-dom wait and bandwidth limit which meant that I would downloadone file every half to one and a half seconds at a maximum of 30kilobytes a second the process took six days In the ProgrammingHistorian 2 I have an article expanding upon wget for humanitiesresearchers

Hypertext itself deserves a brief discussion The basic buildingblock of the web is the link one page is connected to another andpeople navigate through these webs While books and articles forexample are connected to each other by virtue of ideas and concretecitations these must often be inferred online HTML born-digitalsources are different in that the link offers a very concrete file One

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Visualizing the CDC Collection as a Series of Hyperlinks

quick visualization the circular cluster seen above lets us see theunique nature of this information We can go to each web pagescrape the links and see where they take viewers As we see in thebelow figure the CDC collection can be visualized as a series ofhyperlinks (each line) The cluster in the middle is the CDC collec-tion itself whereas the strands on the outside represent externalwebsites Using Mathematica I have generated a dynamic modelyou can move your mouse over each line learn what link it is andthus see where the collection fits into the broader Internet The siteis limited to following two links beyond the first page Through thisapproach we see the interconnection of knowledge and relationshipsbetween different sources At a glimpse we can see suggestions to fol-low other sites and determine if any nodes are shared amongst thevarious websites revealed

Visualization Moving Beyond Word Clouds

Where to begin with a large collection such as the Canadarsquos DigitalCollection corpus As a historian who is primarily interested in thewritten word my first step was to isolate the plaintext throughout thematerial so that I could begin rudimentary data mining and textualanalysis on the corpus For this I approached the CDC collection ontwo distinct levels mdash first the aggregate collection and second divid-ing the collection into each lsquofilersquo or individual website mdash derivingword and phrase data for each Imagine if you could look throughinto an archival collection and get a quick sense of what topics andideas were covered Taking that thought further imagine the utility ifyou could do this with individual archival boxes and even files Witha plethora of born-digital sources this is now a necessary step

On an aggregate level word and phrase data is a way to explorethe information within How often do various words within thenearly eight million words of the collection appear UsingMathematica we can quickly generate a list of words sorted by theirrelative frequency we learn that the word ldquoCanadardquo appears 26429times ldquocollectionsrdquo 11436 and ldquo[L]ouisbourgrdquo an astounding9365 times Reading such a list can quickly wear out a reader and isof minimal visual attractiveness Two quick decisions were made to

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

enhance data usability first the information was ldquocase-foldedrdquo allsent to lower-case this enables us to count the total appearances of aword such as ldquodigitalrdquo regardless of whether or not it appears at thebeginning of a sentence

Second when dealing with word frequency mdash as opposed tophrase frequency mdash ldquostop wordsrdquo should be removed as they werehere These are ldquoextremely common words which would appear to beof little value in helping select documents matching a user needrdquowords such as ldquotherdquo ldquoandrdquo ldquotordquo and so forth They matter forphrases (the example ldquoto be or not to berdquo is comprised entirely ofstop words) but will quickly dominate any word-level visualizationor list53 Other words may need to be added or removed from stoplists depending on the corpus for example if the phrase ldquopage num-berrdquo appeared on every page that would be useless information forthe researcher

The first step then is to use Mathematica to format word fre-quency information into Wordlenet readable format (if we put allof the text into that website it will crash mdash so we need to followthe instructions on that website) we can then draw up a wordcloud of the entire collection Here we can see both the advantagesand disadvantages of static word clouds What can we learn from this

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A Word Cloud of All Words in the CDC Corpus

illustration First without knowing anything about the sites we canimmediately learn that it is dealing with CanadaCanadians that itis a collection with documents in French and English that we aredealing with digitized collections and that there is a fair amount ofregional representation not based upon population (note thatldquoontariordquo is less frequent than both ldquonewfoundlandrdquo ldquonova scotiardquoand ldquoalbertardquo although ldquoqueacutebecrdquo and ldquoquebecrdquo is not terribly promi-nent) Certain kinds of history dominate communities familynames the built form (houses churches schools and so forth) At avery macro level we can learn several characteristics before doing anydeeper investigation

These forms of visualization need to be used with extreme cau-tion however First while visually attractive it leaves a severeamount of ambiguity are ldquoontariordquo and ldquoqueacutebecrdquo equally promi-nent for example or is there a slight divergence Even moreproblematically we cannot learn context from the above picturedoes the word ldquoschoolrdquo refer to built structures or educational expe-riences The same question with applies to ldquochurchrdquo Do mentionsof ldquonewfoundlandrdquo refer to historical experiences there or withexpatriates or in travel narratives Word clouds are useful first stepsand have their place but need to be used with extreme caution Formore scholarly applications tables offer much more precise inter-pretation As the second figure below demonstrates they can bemore helpful when dealing with the individual sites cuttingthrough ambiguity The figure below is a visualization of the web-site ldquopast and presentrdquo Despite the ambiguous title we can quicklylearn what this website contains from this visualization Wordle isthus a useful first step analogous to taking magically taking a quicklook through the side of an archival box We still need more infor-mation however to really make this form of textual analysishelpful To navigate this problem I have created my own dynamicfinding aid program using Mathematica which scales to any born-digital web collection

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Two approaches were taken First a textual representation of wordand phrase frequency allows us to determine the content of a given filerelatively quickly I wrote a short program to extract word frequency(stop words excluded) and phrase frequency For phrases we speak ofn-grams bigrams trigrams quadgrams and fivegrams although onecould theoretically do any higher number A bigram can be a combina-tion of two characters such as lsquobirsquo two syllables or mdash pertinent in ourcase mdash two words An example ldquocanada isrdquo A trigram is three wordsquadgram is four words and a fivegram is five words and so on

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Word Cloud of the File ldquoPast and Presentrdquo

A viewer for n-grams in the CDC Collection

The CDC collection demonstrates this approachrsquos utility Usingmy Mathematica textual CDC browser pictured above I randomlyexplored file 267 the ambiguously named ldquopast and presentrdquo web-site This is one of the largest individual websites 122363 words orabout 493 standard typed pages Such an ambiguous name could beabout almost anything Initially I decided to study word frequency

ldquoldquoalbertardquo 758rdquo ldquoldquolistenrdquo 490rdquo ldquoldquosettlersrdquo 467rdquoldquoldquonewrdquo 454rdquo ldquoldquocanadardquo 388rdquo ldquoldquolandrdquo 362rdquo ldquoldquoreadrdquo328rdquo ldquoldquopeoplerdquo 295rdquo ldquoldquosettlementrdquo 276rdquo ldquoldquochineserdquo268rdquo ldquoldquocanadianrdquo 264rdquo ldquoldquoplacerdquo 255rdquo ldquoldquoheritagerdquo254rdquo ldquoldquoedmontonrdquo 250rdquo ldquoldquoyearsrdquo 238rdquo ldquoldquowestrdquo 236rdquoldquoldquoraymondrdquo 218rdquo ldquoldquoukrainianrdquo 216rdquo ldquoldquocamerdquo 213rdquoldquoldquocalgaryrdquo 209rdquo ldquoldquonamesrdquo 206rdquo ldquoldquoranchrdquo 204rdquoldquoldquoupdatedrdquo 193rdquo ldquoldquohistoryrdquo 190rdquo ldquoldquocommunitiesrdquo183rdquo hellip

Rather than having to visit the website at a glance we can almostimmediately see what the site is about More topics become apparentwhen one looks at trigrams for example in this automatically gen-erated output of generated top phrases

ldquoldquofirstrdquo ldquopeoplerdquo ldquosettlersrdquo 100rdquo ldquoldquotherdquo ldquounitedrdquoldquostatesrdquo 72rdquo ldquoldquoonerdquo ldquoofrdquo ldquotherdquo 59rdquo ldquoldquoofrdquo ldquotherdquoldquowestrdquo 58rdquo ldquoldquosettlersrdquo ldquolastrdquo ldquoupdatedrdquo 56rdquo ldquoldquopeoplerdquoldquosettlersrdquo ldquolastrdquo 56rdquo ldquoldquoalbertansrdquo ldquofirstrdquo ldquopeoplerdquo 56rdquoldquoldquoadventurousrdquo ldquoalbertansrdquo ldquofirstrdquo 56rdquo ldquoldquotherdquo ldquobarrdquoldquourdquo 55rdquo ldquoldquoopeningrdquo ldquoofrdquo ldquotherdquo 54rdquo ldquoldquocommunitiesrdquoldquoadventurousrdquo ldquoalbertansrdquo 54rdquo ldquoldquonewrdquo ldquocommunitiesrdquoldquoadventurousrdquo 54rdquo ldquoldquobarrdquo ldquourdquo ldquoranchrdquo 53rdquo ldquoldquolistenrdquoldquotordquo ldquotherdquo 52rdquo hellip

Through a few clicks we can navigate and determine the basiccontours mdash from a distance mdash of what the site pertains to In aviewer you can use a slider to dynamically see more informationor less information quickly toggling between bigrams trigramsand quadgrams allowing you to quickly hone in on immediately

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

relevant information This can be useful but we need to moveback even further to achieve a distant reading of this born-digitalcollection

If one zooms back out to this very distant level we can extractn-gram data from the entire collection We have two approachesfirst aggregate appearances of a given term (if ldquoaboriginalrdquo appears457 times then n=457) or second relative appearances of a giventerm (if ldquoaboriginalrdquo appears in 45 of 384 sites then n=45384 or01171875) In my other work on music lyrics the latter optionenables you to control for the lsquochorusrsquo effect as the former can be drowned out by one song with repetitive words Given thenature of this collection I elected for the former approach Whatcan we learn from trigrams for example After an initial set of trigrams pertaining to the Louisbourg digitization project whichhas boilerplate language on each page as well as common ldquostop-wordrdquo equivalents of phrases (especially pronounced in French) we come across the first set of meaningful phrases excerptedbelow

ldquoexperiencerdquo ldquoinrdquo ldquotherdquo 919 ldquotransferredrdquo ldquotordquo ldquotherdquo918 ldquofundedrdquo ldquobyrdquo ldquotherdquo 918 ldquoofrdquo ldquotherdquo ldquoprovincialrdquo915 ldquothisrdquo ldquowebrdquo ldquositerdquo 915 ldquo409rdquo ldquoregistryrdquo ldquo2rdquo914 ldquoarerdquo ldquonordquo ldquolongerrdquo 914 ldquogouvernementrdquo ldquodurdquo ldquocanadardquo 910 ldquouserdquo ldquothisrdquo ldquolinkrdquo 887ldquopagerdquo ldquoorrdquo ldquoyourdquo 887 ldquofollowingrdquo ldquopagerdquo ldquoorrdquo 887 ldquoautomaticallyrdquo ldquotransferredrdquo ldquotordquo 887 ldquoberdquoldquoautomaticallyrdquo ldquotransferredrdquo 887 ldquowillrdquo ldquoberdquo ldquoautomaticallyrdquo 887 ldquoindustryrdquo ldquoyourdquo ldquowillrdquo 887ldquomultimediardquo ldquoindustryrdquo ldquoyourdquo 887 ldquopracticalrdquo ldquoexperiencerdquo ldquoinrdquo 887 ldquogainedrdquo ldquopracticalrdquo ldquoexperiencerdquo887 ldquotheyrdquo ldquogainedrdquo ldquopracticalrdquo 887 ldquowhilerdquo ldquotheyrdquoldquogainedrdquo 887 ldquocanadiansrdquo ldquowhilerdquo ldquotheyrdquo 887ldquoyoungrdquo ldquocanadiansrdquo ldquowhilerdquo 887 ldquoshowcaserdquo ldquoofrdquoldquoworkrdquo 887 ldquoexcellentrdquo ldquoshowcaserdquo ldquoofrdquo 887 hellip

Here we see an array of common language funding informa-tion link information project information spread across over 887

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

websites Reading down we see information about governments jobexperience project information initiatives re-structuring and soforth At a glance we can see what language is common to all web-sites as opposed to unique information Moving down we find themore common but less universal ldquothe history ofrdquo for example ldquoher-itage community foundationrdquo libraries and so forth By the end ofthe list we can move into the unique or very rare phrases that appearonly once twice or three times

Moving beyond word and phrase frequency there are a fewmore sophisticated off-the-shelf options that can help us make senseof large arrays of unstructured data Topic modeling is among themost developed as Princetonrsquos David M Blei (a Computer Scienceprofessor) describes in a general introduction to the topic

While more and more texts are available online we simplydo not have the human power to read and study them toprovide the kind of [traditional] browsing experiencedescribed above To this end machine learning researchershave developed probabilistic topic modeling a suite of algo-rithms that aim to discover and annotate large archives ofdocuments with thematic information Topic modelingalgorithms are statistical methods that analyze the words ofthe original texts to discover the themes that run throughthem how those themes are connected to each other andhow they change over time hellip Topic modeling enables usto organize and summarize electronic archives at a scalethat would be impossible by human annotation55

Using the aforementioned SEASR Meandre workbench (or viaMALLET itself and a command-line interface) we can quickly beginto run our born-digital information through the topic-modelingalgorithm The results are notable and come in two formats a wordcloud visualization as well as an XML file containing a line-by-linebreakdown of each topic Taking the aforementioned file 267 (ldquopastand presentrdquo) we can begin to get a sense of the topics this large web-site discusses

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

At a glance we can learn about some of the main themes the struc-ture of Alberta (Alberta towns bridges lakes) the geographic shapeof the area (life land lands areas links grass and interestingly lan-guage) the stories (story knight archives but also interestinglyknight and wife) and so forth The word cloud gives us a sense ofthe most relevant terms that define each topic but we also have a3200-line XML file broken into topics which lets us further refineour inquiry

A last set of examples will help us move from the aggregate levelof hundreds of thousands of lines of text to the individual line inessence creating our own search engine to navigate this collection

There are a few good reasons why we should not use Google todo all of this in our own work First we do not know how Googleworks precisely It is a proprietary black box While the original algo-rithm is available the actual methods that go into determining theranking of search terms is a trade secret Second Google is optimizedfor a number of uses mainly to deliver the most relevant informa-tion to the user Historians and archivists are looking often forspecific information within old websites and it behooves us to exam-ine them through the lens of more specialized tools With thesehesitations in mind I have sought to develop my own rudimentarysearch engine for this born-digital collection

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A two-pronged approach can take us to relevant informationquickly In the following example I am carrying out a project onblack history We quickly search all of the CDC word frequency filesfor the occurrence of the word ldquofreedomrdquo ldquoblackrdquo and ldquolibertyrdquo(these terms of course require some background knowledge thesearch is shaped by the user) obtain a list and then map it onto achart in order to quickly visualize their locations in the collection Forexample in the graph above we see a visualization of a given term inthis case our three search terms As one moves the mouse over eachbar the frequency data for a given term appears We can also zoomand enhance this graph to use it as we wish A demonstration of thiscan be found online at wwwianmilliganca20130118cdc-tool-1

We can then move down into the data In the example aboveldquo047rdquo labeled ldquoblackgoldtxtrdquo contains 31 occurrences of the wordldquoontariordquo Moving to my Mathematica keyword-in-context routinewe can see where these occurrences of the word appear

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

A snapshot of the dynamic finding aid tool

We see that this website focuses mostly on southwestern Ontariowhere many oil fields are If we move forward we can see that it per-tains to the nineteenth century a particular region as well assubsequent memorialization activities (mention of a museum forexample) All this can be done from within the textual client asopposed to going out into individual websites Broad contoursemerge from this distant reading but they also facilitate deeper read-ing by allowing us to promptly realize critical sources

Another model through which we can visualize documents isthrough a viewer primarily focusing on term document frequencyand inverse frequency terms discussed below To prepare this mater-ial we are combining our earlier data about word frequency (that theword ldquoexperiencerdquo appears 919 times perhaps) with aggregate infor-mation about frequency in the entire corpus (ie the wordldquoexperiencerdquo appears 1766 times) The number of times that a termappears in a given website can be defined as tf (term frequency)Another opportunity is inverse document frequency where we canlearn what words are most distinct and unique in a given documentThe formula for this is the oft-cited tf-idf a staple of informationretrieval theory or tf times the logarithm of the number of docu-ments divided into the number of documents where the given termappears At a glance then we can see some of the unique words con-tained within a website if we select ldquoinverse document frequencyrdquo

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

At a very quick glance one can see the major words that make this aunique document This technique can be used on any textual docu-ment

Off-the-shelf tools can facilitate a deeper yet distant reading ofindividual websites as well especially the suite of research tools onwebsites such as Many Eyes and Voyant Using these tools howeverrequires that one puts individual data into each separate programrather than quickly navigating between them This stems from themassive amount of files and data involved Many Eyes limits files to5MB and Voyant has no hard and fast rule but can struggle withlarger files

This analysis was carried out on my personal computer55 Largerdata sets may necessitate a corresponding increase in computationalpower bringing social sciences and humanities researchers into con-tact with High Performance Computing (HPC) centres or consortiaThese extremely powerful computers provide researchers with com-puting power orders of magnitude above that of personal computersallowing results in hours instead of the weeks or months it might

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

take to process on a smaller system56 This has particular applicationin the basic processing of a very large web archive There are limita-tions however jobs must be submitted into queues in advancerather than on-site processing and perhaps most importantly HPCcentres prefer to run highly optimized code to minimize usage oftheir very expensive and scarcely allocated equipment As this level ofcoding expertise is often beyond humanities researchers researchteams must be formed

Humanities researchers would do well to familiarize themselveswith local HPC options many of which are more than happy towork on new and engaging questions These are accessible optionsin Ontario for example all but one university belongs to a HPCconsortium57 Yet as personal computing continues to dramaticallyincrease in speed and as storage prices continue to plummet textualanalysis is increasingly feasible on high-quality office systems An ini-tial exposure can generate questions and problems that if the needrears itself can subsequently be taken to the supercomputer

This case study is intended to provide one model for distantreading and the generation of finding aids in the interest of showingsome of the potential for working with born-digital sources It isintended to begin or jumpstart a conversation around the resourcesand skills necessary for the next generation A seemingly insur-mountable amount of information contained across over 73000files (as in the CDC collection) can be quickly navigated and iso-lated for researcher convenience

Conclusion Digital Literacy for the Next Generation ofHistorians

As a discipline I would like to suggest one path forward prepare thenext generation of historians at least some of them to deal with thedeluge of born-digital sources Fundamentally the challenge of born-digital sources is a pedagogical one For existing historians this is anissue of continual education for future historians an issue of profes-sional training (the annual Digital Humanities Summer Instituteheld at the University of Victoria each June is worth considering asan introduction to many digital concepts) Some off-the-shelf tools

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

can serve as gateways to deeper engagement notably the VoyantTools suite (wwwvoyant-toolsorg) and perhaps subsequently thelessons contained in the Programming Historian 2 That said digitalliteracy courses should become part of the essential pantheon of his-torical curricula This should not be mandatory but instead taughtalongside and within traditional historiography and research meth-ods courses In an era of diminishing resources and attention paid toknowledge translation and relevancy digital literacy courses are alsowell positioned to provide graduates with concrete portfolios andactionable skills If we want to be prepared to use digital method-ologies to study the post-1990 world we will need ready historians

Such a course should have six basic priorities First we must pro-vide historiographical context for new students on whatcomputational methods have been tried before How have quantita-tive and qualitative methods differed yet ultimately complementedeach other Second students in this course would need to develop anunderstanding of the digital basics cloud computing effectivebackup and versioning digital security and an understanding of howthe Internet is organized Third for effective cross-departmental andsustained engagement with digital history students must to learn howto digitally navigate conventional sources notable skill sets wouldinclude a familiarity with citation management software databasesand the possibility of using optical character recognition on a varietyof born-paper sources subsequently digitized through scanners orcameras This must also include a grasp of the scope and significanceof large digital depositories including Google Books the InternetArchive and others mentioned earlier Fourth students must learnhow to process large amounts of digital information this includes off-the-shelf visualization tools material discussed here and some of thephilosophical and methodological implications of a distance readingFifth students can learn basic programming through the ProgrammingHistorian Finally and perhaps most importantly students need anunderstanding of various ways to openly disseminate material on the web whether through blogging micro-blogging publishing webpages and so forth it is important to enable the great work that we do in the university to make its mark on the world Throughthis students will also come away from history departments with

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

comprehensive and marketable digital portfolios Such a coursewould focus more on homework than reading with special emphasisplaced on deliverable content such as websites interactive modelsand broadly implementable databases

Digital history and the readiness to work with born-digitalsources cannot stem out of pedagogy alone however Institutionalsupport is necessary While based on anecdotal evidence it appearsthat some university administrators have begun to embrace the shifttowards the digital humanities At the federal level SSHRC has iden-tified the Digital Economy as one of its priority areas58 Yet when itcomes to evaluation doctoral and post-doctoral researchers do nothave a ldquodigital humanitiesrdquo option open to them in terms of assessingcommittees leading to fears that their projects might be lost inbetween disciplinary areas of focus59 In light of the impending crisisof born-digital sources historians should encourage these initiatives

These sorts of digital projects do not always lend themselveswell to traditional publication venues The examples above from theCDC case study for example are best viewed as dynamically mal-leable models this will be one challenge as historians grapple withnew publication models This article argues first why digitalmethodologies are necessary and second how we can realize aspectsof this call The Internet is 20 years old and before historians realizeit we will be faced with the crisis of born-digital information over-load If history is to continue as the leading discipline inunderstanding the social and cultural past decisive movementtowards the digital humanities is necessary When the sea comes inwe must be ready to swim

IAN MILLIGAN is an assistant professor of Canadian history at theUniversity of Waterloo He is a founding co-editor of ActiveHistorycaan editor-at-large of the Programming Historian 2 and has written sev-eral articles in the fields of Canadian youth digital and labour history

IAN MILLIGAN est professeur adjoint en histoire canadienne agrave laUniversity of Waterloo Il est le fondateur et codirecteur du site

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Internet ActiveHistoryca ainsi qursquoun eacutediteur du site ProgrammingHistorian 2 Il a eacutecrit plusieurs articles au sujet de la jeunesse cana-dienne de la numeacuterisation et de lrsquohistoire des travailleurs

Endnotes

1 The question of when the past becomes a ldquohistoryrdquo to be professionallystudied is a tricky one Yet it is worth noting that the still-definitivework on the Baby Boom Generation appeared in 1996 with DougOwram Born at the Right Time A History of the Baby Boom Generationin Canada (Toronto University of Toronto Press 1996) This actuallyfollowed the earlier scholarly monograph Cyril Levitt Children ofPrivilege Student Revolt in the Sixties (Toronto University of TorontoPress 1984) and a host of other laudable works As academicdebatesworks on the 1960s now suggest the past becomes historysooner than we often expect

2 A critical point advanced in the field of literary criticism and theory byStephen Ramsay Reading Machines Towards an Algorithmic Criticism(Urbana University of Illinois Press 2011)

3 See ldquoIdleNoMore Timelinerdquo ongoingwwwidlenomoremakookcatimeline44 ltviewed 18 January 2013gt

4 These issues have been discussed elsewhere albeit in either a moreabridged form or with a focus on digitized historical sources A notableexception is Daniel J Cohen and Roy Rosenzweig Digital History AGuide to Gathering Preserving and Presenting the Past on the Web(Philadelphia University of Pennsylvania Press 2005) specifically theirchapter on ldquoCollecting History onlinerdquo Their focus is on the preservation-ist and archival challenges however Important notes are also found inDan Cohen ldquoHistory and the Second Decade of the Webrdquo RethinkingHistory 8 no 2 (June 2004) 293ndash301 also available online atwwwchnmgmueduessays-on-history-new-mediaessaysessayid=34

5 Figure taken from Jose Martinez ldquoTwitter CEO Dick Costolo Reveals Staggering Number of Tweets Per Dayrdquo ComplexTech wwwcomplexcomtech201210twitter-ceo-dick-costolo-reveals-staggering-number-of-tweets-per-day ltviewed 6 January 2013gtFurther more while a tweet itself is only 140 characters every tweetcomes with over 40 lines of metadata date information author biogra-phy and location information when the account was created the usersrsquofollowing and follower numbers country data and so forth This is aremarkable treasure trove of information for future historians but also apotential source of information overload See Sarah Perez ldquoThis is What

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

a Tweet Looks Likerdquo ReadWriteWeb (19 April 2010) wwwreadwritewebcomarchivesthis_is_what_a_tweet_looks_likephpltviewed 6 January 2013gt

6 This is further discussed in Ian Foster ldquoHow Computation ChangesResearchrdquo in Switching Codes Thinking Through Digital Technology inthe Humanities and the Arts eds Thomas Bartscherer and RoderickCoover (Chicago University of Chicago Press 2011) esp 22ndash4

7 Franco Moretti Graphs Maps Trees Abstract Models for Literary History(New York Verso 2005) 3ndash4

8 While not of prime focus here the longue dureacutee deserves definition Itstands opposite the history of events of instances covering instead avery long time span in an interdisciplinary social science framework In along essay Braudel noted that one can then see both structural crises ofeconomies and structures that constrain human society and develop-ment For more see Fernand Braudel ldquoHistory and the Social SciencesThe Longue Dureacuteerdquo in Fernand Braudel On History trans SarahMatthews (Chicago University of Chicago Press 1980) 27 The essaywas originally printed in the Annales ESC no 4 (OctoberndashDecember1958)

9 A great overview can be found at ldquoText Analysisrdquo Tooling Up for DigitalHumanities 2011 wwwtoolingupstanfordedupage_id=981 ltviewed19 December 2012gt

10 Jean-Baptiste Michel Yuan Kui Shen Aviva Presser Aiden Adrian VeresMatthew K Gray The Google Books Team Joseph P Pickett DaleHoiberg Dan Clancy Peter Norvig Jon Orwant Steven Pinker MartinA Nowak Erez Lieberman Aiden ldquoQuantitative Analysis of CultureUsing Millions of Digitized Booksrdquo Science 331 (14 January 2011)176ndash82 The tool is publicly accessible at ldquoGoogle Books NgramViewerrdquo Google Books 2010 wwwbooksgooglecomngrams ltviewed11 January 2013gt

11 Anthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

12 Comment by Jean-Baptiste Michel and Erez Lieberman Aiden onAnthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

13 For more information see the Canadian Century Research InfrastructureProject website at wwwccriuottawacaCCRIHomehtml and the Programme de recherche en deacutemographie historique at wwwgeneologyumontrealcafrleprdhhtm

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

14 ldquoWith Criminal Intentrdquo CriminalIntentorg wwwcriminalintentorgltviewed 20 December 2012gt

15 For an overview see ldquoAward Recipients mdash 2011rdquo Digging into DataWebpage wwwdiggingintodataorgHomeAwardRecipients2011tabid185Defaultaspx ltviewed 20 December 2012gt

16 One can do no better than the completely free Programming Historianwhich uses entirely free and open-source programming tools to learn thebasics of computational history See The Programming Historian 2 2012-Present wwwprogramminghistorianorg ltviewed 11 January 2013gt

17 Michael Katz The People of Hamilton Canada West Family and Class ina Mid-Nineteenth-Century City (Cambridge MA Harvard UniversityPress 1975) Also see A Gordon Darroch and Michael D OrnsteinldquoEthnicity and Occupational Structure in Canada in 1871 The VerticalMosaic in Historical Perspectiverdquo Canadian Historical Review 61 no 3(1980) 305ndash33

18 Ian Anderson ldquoHistory and Computingrdquo Making History The ChangingFace of the Profession in Britain 2008 wwwhistoryacukmakinghistoryresourcesarticleshistory_and_computinghtml ltviewed 20 December2012gt

19 Christopher Collins ldquoText and Document Visualizationrdquo TheHumanities and Technology Camp Greater Toronto Area VisualizationBootcamp University of Ontario Institute of Technology Oshawa Ont21 October 2011

20 The quotations by Luther and Poe are commonly used to make thispoint For the best example see Clay Shirky Cognitive SurplusCreativity and Generosity in a Connected Age Google eBook Penguin2010 For Mumford see Lewis Mumford The Myth of the Machine vol2 The Pentagon of Power (New York Harcourt Brace 1970) 182 asquoted in Gleick The Information A History A Theory A Flood (NewYork Pantheon 2011) 404

21 Adam Ostrow ldquoCisco CRS-3 Download Library of Congress in JustOver One Secondrdquo Mashable (9 March 2010) accessed onlinewwwmashablecom20100309cisco-crs-3

22 A note on data size is necessary The basic unit of information is a ldquobyterdquo (8 bits) which is roughly a single character A kilobyte (KB) is athousand bytes a megabyte (MB) a million a gigabyte (GB) a billion aterabyte (TB) a trillion and a petabyte (PB) a quadrillion Beyond thisthere are exabytes (EB) zettabytes (ZB) and yottabytes (YB)

23 These figures can of course vary depending on compression methodschosen Their math was based on an average of 8MB per book if it wasscanned at 600dpi and then compressed There are 26 million books inthe Library of Congress For more see Mike Ashenfelder ldquoTransferringlsquoLibraries of Congressrsquo of Datardquo Library of Congress Digital Preservation

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Blog 11 July 2011wwwblogslocgovdigitalpreservation201107transferring-libraries-of-congress-of-data ltviewed 11 January 2013gt

24 Gleick 232 25 ldquoAbout the Library of Congress Web Archivesrdquo Library of Congress Web

Archiving FAQs Undated wwwlocgovwebarchivingfaqhtmlfaqs_02ltviewed 11 January 2013gt

26 ldquo10000000000000000 Bytes Archivedrdquo Internet Archive Blog 26October 2012 wwwblogarchiveorg2012102610000000000000000-bytes-archivedltviewed 19 December 2012gt

27 Jean-Baptiste Michel et al 176ndash82 Published Online Ahead of Print16 December 2010 wwwsciencemagorgcontent3316014176ltviewed 29 July 2011gt See also Scott Cleland ldquoGooglersquos PiracyLiabilityrdquo Forbes 9 November 2011 wwwforbescomsitesscottcle-land20111109googles-piracy-liability ltviewed 11 January 2013gt

28 ldquoGovernment of Canada Web Archiverdquo Library and Archives Canada 17October 2007 wwwcollectionscanadagccawebarchivesindex-ehtmlltviewed 11 January 2013gt

29 For background see Bill Curry ldquoVisiting Library and Archives inOttawa Not Without an Appointmentrdquo Globe and Mail 1 May 2012wwwtheglobeandmailcomnewspoliticsottawa-notebookvisiting-library-and-archives-in-ottawa-not-without-an-appointmentarticle2418960 ltviewed 22 May 2012gt and the announcement ldquoLAC Begins Implementation of New Approach to service Deliveryrdquo Library and Archives Canada February 2012wwwcollectionscanadagccawhats-new013-560-ehtml ltviewed 22May 2012gt For professional responses see the ldquoCHArsquos comment onNew Directions for Library and Archives Canadardquo CHA Website March2012 wwwcha-shccaenNews_39items19html ltviewed 19December 2012gt and ldquoSave Library amp Archives Canadardquo CanadianAssociation of University Teachers launched AprilMay 2012 wwwsavelibraryarchivesca ltviewed 19 December 2012gt

30 Ian Milligan ldquoThe Smokescreen of lsquoModernizationrsquo at Library andArchives Canadardquo ActiveHistoryca 22 May 2012wwwactivehistoryca201205the-smokescreen-of-modernization-at-library-and-archives-canada ltviewed 19 December 2012gt

31 ldquoSave LAC May 2012 Campaign Updaterdquo SaveLibraryArchivesca May2012 wwwsavelibraryarchivescaupdate-2012-05aspx ltviewed 19December 2012gt

32 The challenges and opportunities presented by such projects is extensively discussed in Daniel J Cohen ldquoThe Future of Preserving the Pastrdquo CRM The Journal of Heritage Stewardship 2 no 2

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

(Summer 2005) 6ndash19 also available freely online at the RoyRosenzweig Center for History and New Mediawwwchnmgmueduessays-on-history-new-mediaessaysessayid=39For more see ldquoSeptember 11 Digital Archive Saving the Histories of September 11 2001rdquo Centre for History and New Media andAmerican Social History Project Center for Media and Learningwww911digitalarchiveorg ltviewed 11 January 2013gt ldquoHurricaneDigital Memory Bank Collecting and Preserving the Stories of Katrina and Ritardquo Center for History and New Media wwwhurricanearchiveorg ltviewed 11 January 2013gt and ldquoOccupyArchive Archiving the Occupy Movements from 2011rdquo Centre forHistory and New Media wwwoccupyarchiveorg ltviewed 11 January2013gt

33 Daniel Cohen and Roy Rosenzweig Digital History A Guide to Gathering Preserving and Presenting the Past on the Web(Philadelphia Temple University Press 2005) chapter on ldquoPreservingDigital Historyrdquo The chapter is available fully online atwwwchnmgmuedudigitalhistorypreserving5php ltviewed 19December 2012gt

34 For an overview see ldquoPreserving Our Digital Heritage The NationalDigital Information Infrastructure and Preservation Program 2010 Report a Collaborative Initiative of the Library of CongressrdquoDigitalpreservationgov wwwdigitalpreservationgovmultimediadocumentsNDIIPP2010Report_Postpdf ltviewed 19 December 2012gt

35 Steve Meloan ldquoNo Way to Run a Culturerdquo Wired 13 February 1998available online via the Internet Archivewwwwebarchiveorgweb20000619001705httpwwwwiredcomnewsculture012841030100html ltviewed 19 December 2012gt

36 Some of this has been discussed in greater depth on my website See theseries beginning with Ian Milligan ldquoWARC Files A Challenge forHistorians and Finding Needles in Haystacksrdquo ianmilliganca 12December 2012 wwwianmilliganca20121212warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks ltviewed 20 December2012gt

37 Unstructured data as opposed to structured data is important for histo-rians but brings additional challenges Some projects have producedstructured data novels marked up using XML encoding for example soone can immediately see place names dates character names and soforth This makes it much easier for a computer to read and understand

38 See William Noblett ldquoDigitization A Cautionary Talerdquo New Review ofAcademic Librarianship 17 (2011) 3 Tobias Blanke Michael Bryantand Mark Hedges ldquoOcropodium Open Source OCR for Small-ScaleHistorical Archivesrdquo Journal of Information Science 38 no 1 (2012) 83

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 7: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

poverty and other endemic structural features8 While his methodol-ogy does not apply itself well to rapidly changing societies (his focuswas on the long-term slow change) his points around distant read-ing and the need for collaborative interdisciplinary research are auseful antecedent Similar opportunities present themselves whendealing with the vast array of born-digital sources (as well as digitizedhistorical sources) we can distantly read large arrays of sources froma longue dureacutee of time and space mediated through algorithms andcomputer interfaces opening up the ability to consider the voices ofeverybody preserved in the source

Historians must be open to the digital turn due to the astound-ing growth of digital sources and an increasing technical ability toprocess them on a mass scale Social sciences and humanitiesresearchers have begun to witness a profound transformation in howthey conduct and disseminate their work Indeed historical digitalsources have reached a scale where they defy conventional analysisand now require computational analysis Responding to thisarchives are increasingly committed to preserving cultural heritagematerials in digital forms

I would like to pre-emptively respond to two criticisms of thisdigital shift within the historical profession First historians will notall have to become programmers Just as not all historians need a firmgrasp of Geographical Information Systems (GIS) or a developedunderstanding of the methodological implications of community-based oral history the incorporation of a transnational perspective orin-depth engagement with cutting edge demographic models not allhistorians have to approach their trade from a computational per-spective Nor should they Digital history does not replace closereading traditional archival inquiry or going into communities mdash touse only a few examples mdash to uncover notions of collective memoryor trauma Indeed digital historians will play a facilitative role andprovide a broader reading context Yet they will still be historians col-lecting relevant sources analyzing and contextualizing them situatingthem in convincing narratives or explanatory frameworks and dis-seminating their findings to wider audiences

Neither will historians replace professional programmers Insteadas with other subfields historians using digital sources will need to be

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

prepared to work on larger truly interdisciplinary teams Computer sci-entists statisticians and mathematicians for example have criticalmethodological and analytical contributions to make towards questionsof historical inquiry Historians need to forge further links with thesedisciplines Such meaningful collaboration however will require histo-rians who can speak their languages and who have an understanding oftheir work Some of us must be able to discuss computational or algo-rithmic methodologies when we collectively explore the past

The methodologies discussed in this paper fall into the twinfields of data mining and textual analysis The former refers to thediscovery of patterns or information about large amounts of infor-mation while the latter refers to the use of digital tools to analyzelarge quantities of text9 The case study here draws upon three mainconcepts information retrieval (quickly finding what you want)information and term frequency (how often do various concepts andterms appear) and topic modeling (finding frequently occurringbags of words throughout documents) Other projects have involvedauthorship attribution which is finding out the author of docu-ments based on previous patterns style analysis or creating networkgraphs of topics groups documents or individuals

The Culturomics project run by the Cultural Observatory atHarvard University is a notable model of collaboration on a big his-tory project As laid out in Science and put online as the GoogleBooks n-gram viewer (wwwbooksgooglecomngrams) this projectindexed word and phrase frequency across over five million booksenabling researchers to trace the rise and fall of cultural ideas andphenomena through targeted keyword and phrase searches10 This isthe most ambitious and certainly the most advanced and accessibleBig History project yet launched Yet amongst the extensive authorlist (13 individuals and the Google Books team) there were no his-torians This suggests that other disciplines in the humanities andsocial sciences have embraced the digital turn far more than histori-ans However as I mentioned in the coming decades historians willincreasingly rely on digital archives There is a risk that history as aprofessional discipline could be left behind

American Historical Association President Anthony Graftonnoted this absence in a widely circulated Perspectives article Why was

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

there no historian he asked when this was both primarily a histori-cal project and a team comprised of varied doctoral holders fromdisciplines such as English literature psychology computer sciencebiology and mathematics12 Culturomics project leaders ErezLeiberman Aiden and Jean-Baptiste Michel responded franklyWhile the team approached historians and some did play advisoryroles every author listed ldquodirectly contributed to either the creationof the collection of written texts (the lsquocorpusrsquo) or to the design andexecution of the specific analyses we performed No academic histo-rians met this barrdquo They continued to indict the profession

The historians who came to the meeting were intelligentkind and encouraging But they didnrsquot seem to have agood sense of how to wield quantitative data to answerquestions didnrsquot have relevant computational skills anddidnrsquot seem to have the time to dedicate to a big multi-author collaboration Itrsquos not their fault these things donrsquotappear to be taught or encouraged in history departmentsright now13

This is a serious indictment considering the pending archival shift todigital sources and one that still largely holds true (a few historydepartments now offer digital history courses but they are still out-liers) This form of history the massive analysis of digitizedinformation will continue Furthermore these approaches willattract public attention as they lend themselves well to accessibleonline deployment Unless historians are well positioned they riskceding professional ground to other disciplines

While historians were not involved in Culturomics they arehowever involved in several other digital projects Two major andambitious long-term Canadian projects are currently using compu-tational methods to reconstruct the demographic past The first isthe Canadian Century Research Infrastructure project which is cre-ating databases from five censuses and aiming to provide ldquoa newfoundation for the study of social economic cultural and politicalchangerdquo while the second is the Programme de recherche en deacutemo-graphie historique hosted by the Universiteacute de Montreacuteal13 They are

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

reconstructing the European population of Quebec in the seven-teenth and eighteenth centuries drawing heavily on parish registersFurthermore several projects in England have harnessed the wave ofinterest in ldquobig datardquo to tell compelling historical stories about theirpasts the Criminal Intent project for example used the proceedingsof the Old Bailey courts (the main criminal court in Greater London)to visualize large quantities of historical information14 The CriminalIntent project was itself funded by the Digging into Data Challengewhich brings together multinational research teams and fundingfrom federal-level funding agencies in Canada the United States theUnited Kingdom and beyond Several other historical studies havebeen funded under their auspices from studies of commodity tradein the eighteenth- and nineteenth-century Atlantic world (involvingscholars at the University of Edinburgh the University of StAndrews and York University) to data mining the degrees of eco-nomic opportunity and spatial mobility in Britain Canada and theUnited States (bringing researchers at the University of Guelph theUniversity of Leicester and the University of Minnesota together)15

These exciting interdisciplinary teams may be heralding a disciplinalshift although it is still early days

Collaborative teams have limitations though We will have tobe prepared to tackle digital sources alone on shoestring budgetsHistorians will not always be able to call on external programmers orspecialists to do work because teams often develop out of fundinggrants As governmental and university budgets often prove unableto meet consistent research needs baseline technical knowledge isrequired for independent and unfunded work If historians relyexclusively on outside help to deal with born-digital sources the abil-ity to do computational history diminishes As the success of theProgramming Historian indicates and as I will demonstrate in mycase study below historians can learn basic techniques16 Let ustweak our own algorithms

A social or cultural history of the very late-twentieth or twenty-first century will have to account for born-digital sources andhistorians will (in conjunction with archival professionals) have tocurate and make sense of the data Historians need to be able toengage with these sources themselves or at least as part of a team

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

rather than having sources mediated through the lens of outside dis-ciplines or commercial interests Algorithms may be one meaningfulway to make sense of cultural trends or extract meaningful informa-tion from billions of tweets but we must ensure professionalhistorical input goes into their crafting

The Third Wave of Computational History

Historians have fruitfully used computers in the past as part of twoprevious waves The 1960s and 1970s saw computational tools usedfor demographic population and economic histories a trend that inCanada saw full realization in comprehensive studies such as MichaelKatzrsquos The People of Hamilton Canada West17 As Ian Anderson hasconvincingly noted in a review of history and computing studies ofthis sort saw the field become associated with quantitative studies18

After a retreat from the spotlight computational history rose again inthe 1990s with the advent of the personal computer graphical userinterfaces and overall improvements in ease of use Punch cards couldnow give way to database programs GIS and early online networkssuch as H-Net and Usenet We are now I believe on the cusp of athird revolution in computational history thanks to three main fac-tors decreasing storage costs the power of the Internet anddistributed cloud computing and the rise of professionals dealingwith both digital preservation and open-source tools

Decreasing storage costs have led to massive amounts of infor-mation being preserved as Gleick noted in this articlersquos epigraph Itis important to note that this information overload is not newPeople have long worried about the impact of too much informa-tion19 In the sixteenth century responding to the rise of the printingpress the German priest Martin Luther decried that the ldquomultitudeof books [were] a great evilrdquo In the nineteenth century Edgar AllanPoe bemoaned ldquo[t]he enormous multiplication of books in everybranch of knowledge is one of the greatest evils of this agerdquo Asrecently as 1970 American historian Lewis Mumford lamented ldquotheoverproduction of books will bring about a state of intellectual ener-vation and depletion hardly to be distinguished from massiveignorancerdquo20 The rise of born-digital sources must thus be seen in

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

this continuous context of handwringing around the expansion andrise of information

Useful historical information is being preserved at a rate that isaccelerating with every passing day One useful metric to comparedataset size is that of the American Library of Congress (LOC)Indeed a ldquoLOCrdquo has become shorthand for a unit of measurementin the digital humanities Measuring the actual extant of the printedcollection in terms of data is notoriously difficult James Gleick hasclaimed it to be around 10TB22 whereas the Library of Congressitself claims a more accurate figure is about 200TB23 While thesetwo figures are obviously divergent neither is on an unimaginablescale mdash personal computers now often ship with a terabyte or moreof storage When Claude Shannon the father of information theorycarried out a thought experiment in 1949 about items or places thatmight store ldquoinformationrdquo (an emerging field which viewed messagesin their aggregate) he scaled sites from a digit upwards to the largestcollection he could imagine The list scaled up logarithmically from10 to the power of 0 to 10 to the power of 13 If a single-spaced pageof typing is 104 107 the Proceedings of the Institute of Radio Engineers109 the entire text of the Encyclopaedia Brittanica then 1014 was ldquothelargest information stockpile he could think of the Library ofCongressrdquo24

But now the Library of Congressrsquo print collection mdash somethingthat has taken two centuries to gather mdash is dwarfed by their newestcollection of archived born-digital sources the vast majority only amatter of years old compared to the much wider date-range of non-digital sources in the traditional collection The LOC has beguncollecting its own born-digital Internet archive multiple snapshotsare taken of webpages to show how they change over time Even asa selective curated collection drawing on governmental politicaleducational and creative websites the LOC has already collected254TB of data and adds 5TB a month25 The Internet Archivethrough its Wayback Machine is even more ambitious It seeks toarchive every website on the Internet While its size is also hard toput an exact finger on as of late 2012 it had over 10 Petabytes ofinformation (or to put it into perspective a little over 50 Library ofCongresses if we take the 200TB figure)26

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

This is to some degree comparing apples and oranges theLibrary of Congress is predominantly print whereas the InternetArchive has considerable multimedia holdings which includesvideos images and audio files The LOC is furthermore a morecurated collection whereas the Internet Archive draws from a widerrange of producers The Internet offers the advantages and disadvan-tages of being a more democratic archive For example a Geocitiessite (an early way for anybody to create their own website) created bya 12-year old Canadian teenager in the mid-1990s might be pre-served by the Internet Archive whereas it almost certainly would notbe saved by a national archive This difference in size and collectionsmanagement however is at the root of the changing historiansrsquotoolkit If we make use of these large data troves we can access a newand broad range of historical subjects

Conventional sources are also part of this deluge Google Bookshas been digitizing the printed word and it currently has 15 millionworks completed by 2020 it audaciously aims to digitize every bookever written While copyright issues loom the project will allowresearchers to access aggregate word and phrase data as seen in theGoogle n-gram project27 In Canada Library and Archives Canada(LAC) has a more circumscribed digitization project collecting ldquoarepresentative sample of Canadian websitesrdquo focusing particularlyon Government of Canada webpages as well as a curated approach

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

preserving case studies such as the 2006 federal election Currentlyit has 4TB of federal government information and 7TB of totalinformation in its larger web archive (a figure which as I note belowis unlikely to drastically improve despite ostensible goals of archivallsquomodernizationrsquo)28

In Canada LAC has mdash on the surface mdash has been advancingan ambitious digitization program as it realizes the challenges nowfacing archivists and historians Daniel J Caron the Librarian andArchivist of Canada has been outspoken on this front with severalpublic addresses on the question of digital sources and archivesThroughout 2012 primary source digitization at LAC preoccupiedcritics who saw it as a means to depreciate on-site access TheCanadian Association of University Teachers the Canadian HistoricalAssociation as well as other professional organizations mountedcampaigns against this shift in resources29 LAC does have a pointhowever with their modernization agenda The state of Canadarsquosonline collections are small and sorely lacking when compared totheir on-site collections and LAC does need to modernize and achievethe laudable goal of reaching audiences beyond Ottawa

Nonetheless digitization has unfortunately been used as asmokescreen to depreciate overall service offerings30 The modern-ization agenda would also see 50 percent of the digitization staffcut31 LACrsquos online collections are small they do not work withonline developers through Application Programming Interfaces(APIs a way for a computer program to talk to another computerand speed up research mdash Canadianaca is incidentally a leader in thisarea) and there have been no promising indications that this state ofaffairs will change Digitization has not been comprehensive Onlinepreservation is important and historians must fight for both on-siteand on-line access

A number of historians have recognized this challenge and areplaying an instrumental role in preserving the born-digital past TheRoy Rosenzweig Center for History and New Media at GeorgeMason University launched several high-profile preservation pro-jects the September 11th Digital Archive the Hurricane DigitalMemory Bank (focusing on Hurricanes Katrina and Rita) and as ofwriting the Occupy archive32 Massive amounts of online content

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

is curated and preserved photographs news reports blog posts andnow tweets These complement more traditional efforts of collectingand preserving oral histories and personal recollections which arethen geo-tagged (tied to a particular geographical location) tran-scribed and placed online These archives serve a dual purposepreserving the past for future generations while also facilitating easydissemination for todayrsquos audiences

Digital preservation however is a topic of critical importanceMuch of our early digital heritage has been lost Many websites espe-cially those before the Internet Archiversquos 1996 web archiving projectlaunch are completely gone Early storage mediums (from tape toearly floppy disks) have become obsolete and are now nearly inac-cessible Compounding this software development has seen earlyproprietary file formats depreciated ignored and eventually madewholly inaccessible33 Fortunately the problem of digital preservationwas recognized by the 1990s and in 2000 the American governmentestablished the National Digital Information Infrastructure andPreservation Program34 As early as 1998 digital preservationistsargued ldquo[h]istorians will look back on this era hellip and see a period ofvery little information A lsquodigital gaprsquo will span from the beginningof the wide-spread use of the computer until the time we eventuallysolve this problem What wersquore all trying to do is to shorten thatgaprdquo To bear home their successes that quotation is taken from anInternet Archive snapshot of Wired magazinersquos website35

Indeed greater attention towards rigorous metadata (which isbriefly put data about data often preserved in plain text format)open-access file formats documented by the InternationalOrganization for Standardization (ISO) and dedicated digitalpreservation librarians and archivists gives us hope for the future36

It is my belief that cloud storage the online hosting of data in largethird-party offsite centres mitigates the issue of obsolete storagemedium Furthermore that the move towards open source file for-mats even amongst large commercial enterprises such as Microsoftprovides greater documentation and uniformity

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A Brief Case Study Accessing and Navigating Canadarsquos DigitalCollections With an Emphasis on Tools for Born-DigitalSources

What can we do about this digital deluge There are no simpleanswers but historians must begin to conceptualize new additions totheir traditional research and pedagogical toolkits In the section thatfollows I will do two things introduce a case study and several toolsboth proprietary and those that are open and available off-the-shelfto reveal one way forward

For a case study I have elected to use Canadarsquos DigitalCollections (CDC) a collection of lsquodeadrsquo websites removed fromtheir original location on the World Wide Web and now digitallyarchived at LAC In this section I will introduce the collection usingdigitally-archived sources from the early World Wide Web and dis-cuss my research methodology before illustrating how we can derivemeaningful data from massive arrays of unstructured data37

This is a case study of a web archive Several of these method-ologies could be applied to other formats including large arrays oftextual material whether born-digital or not A distant reading of thistype is not medium specific It does however assume that the text ishigh quality Digitized primary documents do not always have textlayers and if they are instituted computationally through OpticalCharacter Recognition errors can appear Accuracy rates for digitizedtextual documents can range from as low as 40 percent for seven-teenth and eighteenth century documents to 80ndash90 percent formodern microfilmed newspapers to over 99 percent for typesetword-processed documents38 That said while historians today maybe more interested in applying distant reading methodologies tomore traditional primary source materials it is my belief that in thenot-so-distant future websites will be an increasingly significantsource for social cultural and political historians

What are Canadarsquos Digital Collections In 1996 IndustryCanada funded by the federal Youth Employment Strategy facili-tated the development of Canadian heritage websites by youth(defined as anyone between the age of 15 and 30) It issued contractsto private corporations that agreed to use these individuals as web

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

developers for the twofold purpose of developing digital skills andinstilling respect for and engagement with Canadian heritage andpreservation39 The collection initially called SchoolNet butrebranded as Canadarsquos Digital Collections in 1999 was accessiblefrom the CDC homepage After 2004 the program was shut downand the vast majority of the websites were archived on LACrsquos web-page This collection is now the largest born-digital Internet archivethat LAC holds worthy of study as content as well as thinkingabout methodological approaches

Conventional records about the program are unavailable in thetraditional LAC collection unsurprising in light of both its recentcreation and backlogs in processing and accessioning archival mate-rials We can use one element of our born-digital toolkit the InternetArchiversquos WaybackMachine (its archive of preserved early webpages)to get a sense of what this project looked like in its heyday The mainwebpage for the CDC was wwwcollectionsicgc However a visitortoday will see nothing but a collection of broken image links (two ofthem will bring you to the English and French-language versions ofthe LAC-hosted web archive although this is only discerniblethrough trial and error or through the sitersquos source code)

With the WaybackMachine we have a powerful albeit limitedopportunity to ldquogo back in timerdquo and experience the website as previ-ous web viewers would have (if a preserved website is available ofcourse) Among the over 420 million archived websites we can searchfor and find the original CDC site Between 12 October 1999 andthe most recent snapshot or web crawl (of the now-broken site) on 6July 2011 the site was crawled 410 times40 Unfortunately every siteversion until the 15 October 2000 crawl has been somehow cor-rupted Content has been instead replaced by Chinese characterswhich translate only into place names and other random words Issuesof early digital preservation noted above have reared their head inthis collection

From late 2000 onward however fully preserved CDC sites areaccessible Featured sites new entries success stories awards curric-ular units and prominent invitations to apply for project fundingpopulate the brightly coloured website The excitement is palpableJohn Manley then Industry Minister declared ldquoit is crucial that

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Canadarsquos young people get the skills and experience they need to par-ticipate in the knowledge-based economy of the futurerdquo and thatCDC ldquoshows that young people facing significant employment chal-lenges can contribute to and learn from the resources available onthe Information Highwayrdquo41 The documentation reflecting thelower level of technical expertise and scarcity of computer hardwareat the time was comprehensive participants were encouraged toidentify a collection digitize it process it (by cleaning up the data)preserve it on a hard drive and display it on the web Issues of copy-right were discussed as well as preservation issues and hardwarerequirements for computers (a computer with an Intel 386 processor8MB RAM a 1GB hard drive and 288bps modem was recom-mended) and training information for employers educators andparticipants42 Sample proposals community liaison ideas amidstother information set out a very comprehensive roadmap for a grass-roots bottom-up approach to digital preservation and heritage

Moving through the preserved websites is an interesting snap-shot at how web design standards changed Colour disappearsreplaced by a more subdued coordinated federal website standard(still largely in effect today) Surveys greet viewers from 2003onwards trying to learn how viewers discovered the site whetherthey think it is a good educational resource (they must have beenrelieved that the majority thought so mdash in 2003 some 84 percentfelt that it was a ldquogoodrdquo resource the best of its type)43 By 2004 theprogram was engaging in active remembrance of its past programsas it neared its ten-year anniversary Proposals stopped beingaccepted in July 2004 but the 600 webpages were still accessiblewith a comprehensive subject index alphabetical listing success sto-ries and detailed copyright information

The Canadarsquos Digital Collections site began to move towardsremoval in November 2006 and was initially completely gone A visi-tor to the site would learn that the content ldquois no longer availablerdquo andan e-mail address was provided for contributors to get their informationback44The information was completely inaccessible and by early 2007a bold notice announced that the now-empty site would close as of 2April 2007 Luckily by April 2007 a notice appeared with a declara-tion that the collections were now archived by Library and Archives

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Canada just in time for the website to completely disappear by July2007 In the archived version subject indexes and headings were goneAll that remained then and today for that matter is a straight-forwardalphabetical list and a disclaimer that noted ldquo[y]ou are viewing a doc-ument archived by Library and Archives Canada Please noteinformation may be out of date and some functionality lostrdquo

The question now is what we can do with this collection ofwebsites Could a distant reading as opposed to a close reading ona targeted level tell us something about this collectionFortunately we have a variety of tools that can help us navigate thislarge array of information several of which I use to navigate largeamounts of data First I create my own tools through computerprogramming This work is showcased in the following section I program in Wolfram Researchrsquos Mathematica 9 Mathematica is anintegrated platform for technical computing that allows users toprocess visualize and interact with exceptional arrays of informa-tion45 It is a language with an engaged and generous user base thatcomplements exhaustive well-written documentation fromWolfram Research To emphasize its ease one line of code couldimport a website in perfectly readable plaintext (avoiding unicodeerrors) as follows input=Import[ldquohttpactivehis-torycaldquordquoPlaintextrdquo] and a second one could make itall in lower text for textual analysis as such lowerpage=ToLowerCase[input] Variables are a critical building blockIn the former example ldquoinputrdquo now holds the entire plaintext of thewebsite ActiveHistoryca and then ldquolowerpagerdquo holds it in lowercaseOnce we have lower-cased information we are quickly able to extractthe following frequent words frequent phrases and could thus cre-ate a dynamic database of n-grams word clouds or run other textualanalysis programs on the material Much of the syntax adheres to arecognizable English-language format

Even more promising is Mathematicarsquos integration with theWolfram|Alpha database (itself accessible at wolframalphacom) acomputational knowledge engine that promises to put incredibleamounts of information at a userrsquos fingertips With a quick com-mand one can access the temperature at Torontorsquos Lester B PearsonInternational Airport at 500 pm on 18 December 1983 the rela-

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

tive popularity of names such as ldquoEdithrdquo or ldquoEmmardquo the Canadianunemployment rate in March 1985 or the Canadian GrossDomestic Product in December 199546 This allows one to quicklycross-reference findings with statistical information or find coinci-dental points of comparison As universities are now being activelyencouraged to contribute their information to Wolfram|Alpha thiswill become an increasingly powerful database

For many researchers the downside of Mathematica is a con-siderable one It is proprietary software and relatively expensive forstudent and non-student alike Outputs however can be dissemi-nated using the Computational Document Format (CDF) enablinghistorians to put their work online and entice the general public todynamically work with their models and findings

Programming is not the be all and end all Information visual-ization in the humanities is a growing international field withCanadian scholars heavily involved The most important and acces-sible example is Voyant Tools47 It is a powerful suite of textualanalysis tools Uploading onersquos own text or collection of texts draw-ing on a webpage or analyzing existing bodies of work one sees thefollowing a ldquoword cloudrdquo of a corpus click on any word to see itsrelative rise and fall over time (word trends) access the original textand click on any word to see its contextual placement within indi-vidual sentences It ably facilitates the ldquodistancerdquo reading of anyreasonably-sized body of textual information and should be a begin-ning point for those interested in this field The only downside isthat Voyant can choke on very large amounts of information as it isa textual analysis rather than robust data mining tool

There are several other complementary tools A notable one is theMany Eyes website hosted by IBM Research Many Eyes allows you toupload datasets be they textual tabular or numerical There are manyvisualization options provided you can plot relationships betweendata compare values (ie bar charts and bubble charts) track dataover time and apply information to maps The only downside toMany Eyes is that you must register and make your datasets available toall other registered researchers as well as IBM This limits your abilityto use copyrighted or proprietary information

More advanced technical users can avail themselves of a host of

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

textual analysis programs found elsewhere on the web The mostcomprehensive gathering of these tools can be found with theSoftware Environment for the Advancement of Scholarly Research(SEASR) and its Meandre workbench Meandre which can beinstalled as a local server on your system or mdash in a best case scenariomdash hosted by an institutional IT department allows you to tap intoa variety of workflows to analyze data48 One workflow mentionedin this article is the topic modeling toolkit found within the MachineLearning for Language Toolkit (MALLET) package which clustersunstructured data into ldquotopicsrdquo which appear together giving a quicksense of a documentrsquos structure49 Other workflows found withinSEASR include Dunningrsquos log-likelihood statistic which compares aselected document to a broader reference corpus For example youcould compare the lyrics that appear in 1968 against all lyrics in thepostwar era and learn what makes 1968 most unique50 sentimentanalyses which attempt to plot changes in emotional words andphrases over time and a variety of entity extraction algorithms(which find ldquopeoplerdquo ldquoplacesrdquo and so forth in a text) Corpuses thatwere previously too big to extract meaningful information from cannow be explored relatively quickly

As befits a technical side to history all of the freely availabletools discussed briefly here (notably wget SEASR and MALLET)are discussed in more technical terms in the peer-reviewedProgramming Historian 2 or on my own blog at ianmilliganca I haveelected to leave them online rather than provide commands in thisarticle they require some technical knowledge and directly workingfrom the web is quicker than here Please consider the following sec-tion an applied version of some of the lessons you can learn throughthese processes

From Web Collection to Dynamic Finding Aids

Let us briefly return to our archival metaphor Todayrsquos historiandealing with conventional sources often has to physically travel to anarchive requisition boxes perhaps take photographs of materials iftime is tight and otherwise take notes With born-digital sourceshowever the formula is reversed Our sources can come to us In a

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

few keystrokes we can bring an unparalleled quantity of informationto our home or work computers With Canadarsquos Digital Collectionsa single command line entry brought the entire archive onto my sys-tem it took days to process the information and I had to leave mycomputer on but it did not require any active user engagement

How big is the CDC collection in terms of sheer informationThe plaintext without any HTML markup is approximately7805135 words If each page contained 250 words it would be theequivalent of a 31220 page essay or 1000 dissertations However itis even more complicated than just reading 1000 dissertations Theinformation is spread across 73585 separate HTML files themselvespart of a much broader corpus of 278785 files Even the sheer sizeof the database requires big data methodologies an aggregate size of726 GB 360 MB of plaintext with HTML encoding reduced to alsquomerersquo 50MB if all encoding is removed A 50MB plain text docu-ment can quickly slow down even a relatively powerful personalcomputer There is no simple way to make this a conventionallyaccessible amount of information

How can we access this information Several different methodsare possible First we can work with the information on the internetitself or we can download it onto our own system I prefer the lattermethod as it decreases computational time adds flexibility andenables free rein to experiment with material Before you do thisthough you need to be familiar with the websitersquos ldquoterms of servicerdquoor the legal documentation that outlines how users are allowed to usea website resources or software In this case LAC provides generousnon-commercial reproduction terms information ldquomay be repro-duced in part or in whole and by any means without charge orfurther permission unless otherwise specifiedrdquo51

The CDC web archive index is organized like so the index islocated at wwwepelac-bacgcca100205301iccdcEAlphabetaspwith every subsequent website branching off the cdc folder Forexample one website would be hosted in wwwepelac-bacgcca100205301iccdcABbenevolat the next in hellipcdcabnatureand so on Using wget a free software program you can downloadall websites in a certain structure the program recursively downloadsinformation in a methodical fashion I first used wget to save the

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

cookies (little files kept on your system in this case one that notedthat you had seen the disclaimer about the archival nature of thewebsite) and then with an one-line command in my command lineto download the entire database52 Due to the imposition of a ran-dom wait and bandwidth limit which meant that I would downloadone file every half to one and a half seconds at a maximum of 30kilobytes a second the process took six days In the ProgrammingHistorian 2 I have an article expanding upon wget for humanitiesresearchers

Hypertext itself deserves a brief discussion The basic buildingblock of the web is the link one page is connected to another andpeople navigate through these webs While books and articles forexample are connected to each other by virtue of ideas and concretecitations these must often be inferred online HTML born-digitalsources are different in that the link offers a very concrete file One

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Visualizing the CDC Collection as a Series of Hyperlinks

quick visualization the circular cluster seen above lets us see theunique nature of this information We can go to each web pagescrape the links and see where they take viewers As we see in thebelow figure the CDC collection can be visualized as a series ofhyperlinks (each line) The cluster in the middle is the CDC collec-tion itself whereas the strands on the outside represent externalwebsites Using Mathematica I have generated a dynamic modelyou can move your mouse over each line learn what link it is andthus see where the collection fits into the broader Internet The siteis limited to following two links beyond the first page Through thisapproach we see the interconnection of knowledge and relationshipsbetween different sources At a glimpse we can see suggestions to fol-low other sites and determine if any nodes are shared amongst thevarious websites revealed

Visualization Moving Beyond Word Clouds

Where to begin with a large collection such as the Canadarsquos DigitalCollection corpus As a historian who is primarily interested in thewritten word my first step was to isolate the plaintext throughout thematerial so that I could begin rudimentary data mining and textualanalysis on the corpus For this I approached the CDC collection ontwo distinct levels mdash first the aggregate collection and second divid-ing the collection into each lsquofilersquo or individual website mdash derivingword and phrase data for each Imagine if you could look throughinto an archival collection and get a quick sense of what topics andideas were covered Taking that thought further imagine the utility ifyou could do this with individual archival boxes and even files Witha plethora of born-digital sources this is now a necessary step

On an aggregate level word and phrase data is a way to explorethe information within How often do various words within thenearly eight million words of the collection appear UsingMathematica we can quickly generate a list of words sorted by theirrelative frequency we learn that the word ldquoCanadardquo appears 26429times ldquocollectionsrdquo 11436 and ldquo[L]ouisbourgrdquo an astounding9365 times Reading such a list can quickly wear out a reader and isof minimal visual attractiveness Two quick decisions were made to

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

enhance data usability first the information was ldquocase-foldedrdquo allsent to lower-case this enables us to count the total appearances of aword such as ldquodigitalrdquo regardless of whether or not it appears at thebeginning of a sentence

Second when dealing with word frequency mdash as opposed tophrase frequency mdash ldquostop wordsrdquo should be removed as they werehere These are ldquoextremely common words which would appear to beof little value in helping select documents matching a user needrdquowords such as ldquotherdquo ldquoandrdquo ldquotordquo and so forth They matter forphrases (the example ldquoto be or not to berdquo is comprised entirely ofstop words) but will quickly dominate any word-level visualizationor list53 Other words may need to be added or removed from stoplists depending on the corpus for example if the phrase ldquopage num-berrdquo appeared on every page that would be useless information forthe researcher

The first step then is to use Mathematica to format word fre-quency information into Wordlenet readable format (if we put allof the text into that website it will crash mdash so we need to followthe instructions on that website) we can then draw up a wordcloud of the entire collection Here we can see both the advantagesand disadvantages of static word clouds What can we learn from this

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A Word Cloud of All Words in the CDC Corpus

illustration First without knowing anything about the sites we canimmediately learn that it is dealing with CanadaCanadians that itis a collection with documents in French and English that we aredealing with digitized collections and that there is a fair amount ofregional representation not based upon population (note thatldquoontariordquo is less frequent than both ldquonewfoundlandrdquo ldquonova scotiardquoand ldquoalbertardquo although ldquoqueacutebecrdquo and ldquoquebecrdquo is not terribly promi-nent) Certain kinds of history dominate communities familynames the built form (houses churches schools and so forth) At avery macro level we can learn several characteristics before doing anydeeper investigation

These forms of visualization need to be used with extreme cau-tion however First while visually attractive it leaves a severeamount of ambiguity are ldquoontariordquo and ldquoqueacutebecrdquo equally promi-nent for example or is there a slight divergence Even moreproblematically we cannot learn context from the above picturedoes the word ldquoschoolrdquo refer to built structures or educational expe-riences The same question with applies to ldquochurchrdquo Do mentionsof ldquonewfoundlandrdquo refer to historical experiences there or withexpatriates or in travel narratives Word clouds are useful first stepsand have their place but need to be used with extreme caution Formore scholarly applications tables offer much more precise inter-pretation As the second figure below demonstrates they can bemore helpful when dealing with the individual sites cuttingthrough ambiguity The figure below is a visualization of the web-site ldquopast and presentrdquo Despite the ambiguous title we can quicklylearn what this website contains from this visualization Wordle isthus a useful first step analogous to taking magically taking a quicklook through the side of an archival box We still need more infor-mation however to really make this form of textual analysishelpful To navigate this problem I have created my own dynamicfinding aid program using Mathematica which scales to any born-digital web collection

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Two approaches were taken First a textual representation of wordand phrase frequency allows us to determine the content of a given filerelatively quickly I wrote a short program to extract word frequency(stop words excluded) and phrase frequency For phrases we speak ofn-grams bigrams trigrams quadgrams and fivegrams although onecould theoretically do any higher number A bigram can be a combina-tion of two characters such as lsquobirsquo two syllables or mdash pertinent in ourcase mdash two words An example ldquocanada isrdquo A trigram is three wordsquadgram is four words and a fivegram is five words and so on

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Word Cloud of the File ldquoPast and Presentrdquo

A viewer for n-grams in the CDC Collection

The CDC collection demonstrates this approachrsquos utility Usingmy Mathematica textual CDC browser pictured above I randomlyexplored file 267 the ambiguously named ldquopast and presentrdquo web-site This is one of the largest individual websites 122363 words orabout 493 standard typed pages Such an ambiguous name could beabout almost anything Initially I decided to study word frequency

ldquoldquoalbertardquo 758rdquo ldquoldquolistenrdquo 490rdquo ldquoldquosettlersrdquo 467rdquoldquoldquonewrdquo 454rdquo ldquoldquocanadardquo 388rdquo ldquoldquolandrdquo 362rdquo ldquoldquoreadrdquo328rdquo ldquoldquopeoplerdquo 295rdquo ldquoldquosettlementrdquo 276rdquo ldquoldquochineserdquo268rdquo ldquoldquocanadianrdquo 264rdquo ldquoldquoplacerdquo 255rdquo ldquoldquoheritagerdquo254rdquo ldquoldquoedmontonrdquo 250rdquo ldquoldquoyearsrdquo 238rdquo ldquoldquowestrdquo 236rdquoldquoldquoraymondrdquo 218rdquo ldquoldquoukrainianrdquo 216rdquo ldquoldquocamerdquo 213rdquoldquoldquocalgaryrdquo 209rdquo ldquoldquonamesrdquo 206rdquo ldquoldquoranchrdquo 204rdquoldquoldquoupdatedrdquo 193rdquo ldquoldquohistoryrdquo 190rdquo ldquoldquocommunitiesrdquo183rdquo hellip

Rather than having to visit the website at a glance we can almostimmediately see what the site is about More topics become apparentwhen one looks at trigrams for example in this automatically gen-erated output of generated top phrases

ldquoldquofirstrdquo ldquopeoplerdquo ldquosettlersrdquo 100rdquo ldquoldquotherdquo ldquounitedrdquoldquostatesrdquo 72rdquo ldquoldquoonerdquo ldquoofrdquo ldquotherdquo 59rdquo ldquoldquoofrdquo ldquotherdquoldquowestrdquo 58rdquo ldquoldquosettlersrdquo ldquolastrdquo ldquoupdatedrdquo 56rdquo ldquoldquopeoplerdquoldquosettlersrdquo ldquolastrdquo 56rdquo ldquoldquoalbertansrdquo ldquofirstrdquo ldquopeoplerdquo 56rdquoldquoldquoadventurousrdquo ldquoalbertansrdquo ldquofirstrdquo 56rdquo ldquoldquotherdquo ldquobarrdquoldquourdquo 55rdquo ldquoldquoopeningrdquo ldquoofrdquo ldquotherdquo 54rdquo ldquoldquocommunitiesrdquoldquoadventurousrdquo ldquoalbertansrdquo 54rdquo ldquoldquonewrdquo ldquocommunitiesrdquoldquoadventurousrdquo 54rdquo ldquoldquobarrdquo ldquourdquo ldquoranchrdquo 53rdquo ldquoldquolistenrdquoldquotordquo ldquotherdquo 52rdquo hellip

Through a few clicks we can navigate and determine the basiccontours mdash from a distance mdash of what the site pertains to In aviewer you can use a slider to dynamically see more informationor less information quickly toggling between bigrams trigramsand quadgrams allowing you to quickly hone in on immediately

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

relevant information This can be useful but we need to moveback even further to achieve a distant reading of this born-digitalcollection

If one zooms back out to this very distant level we can extractn-gram data from the entire collection We have two approachesfirst aggregate appearances of a given term (if ldquoaboriginalrdquo appears457 times then n=457) or second relative appearances of a giventerm (if ldquoaboriginalrdquo appears in 45 of 384 sites then n=45384 or01171875) In my other work on music lyrics the latter optionenables you to control for the lsquochorusrsquo effect as the former can be drowned out by one song with repetitive words Given thenature of this collection I elected for the former approach Whatcan we learn from trigrams for example After an initial set of trigrams pertaining to the Louisbourg digitization project whichhas boilerplate language on each page as well as common ldquostop-wordrdquo equivalents of phrases (especially pronounced in French) we come across the first set of meaningful phrases excerptedbelow

ldquoexperiencerdquo ldquoinrdquo ldquotherdquo 919 ldquotransferredrdquo ldquotordquo ldquotherdquo918 ldquofundedrdquo ldquobyrdquo ldquotherdquo 918 ldquoofrdquo ldquotherdquo ldquoprovincialrdquo915 ldquothisrdquo ldquowebrdquo ldquositerdquo 915 ldquo409rdquo ldquoregistryrdquo ldquo2rdquo914 ldquoarerdquo ldquonordquo ldquolongerrdquo 914 ldquogouvernementrdquo ldquodurdquo ldquocanadardquo 910 ldquouserdquo ldquothisrdquo ldquolinkrdquo 887ldquopagerdquo ldquoorrdquo ldquoyourdquo 887 ldquofollowingrdquo ldquopagerdquo ldquoorrdquo 887 ldquoautomaticallyrdquo ldquotransferredrdquo ldquotordquo 887 ldquoberdquoldquoautomaticallyrdquo ldquotransferredrdquo 887 ldquowillrdquo ldquoberdquo ldquoautomaticallyrdquo 887 ldquoindustryrdquo ldquoyourdquo ldquowillrdquo 887ldquomultimediardquo ldquoindustryrdquo ldquoyourdquo 887 ldquopracticalrdquo ldquoexperiencerdquo ldquoinrdquo 887 ldquogainedrdquo ldquopracticalrdquo ldquoexperiencerdquo887 ldquotheyrdquo ldquogainedrdquo ldquopracticalrdquo 887 ldquowhilerdquo ldquotheyrdquoldquogainedrdquo 887 ldquocanadiansrdquo ldquowhilerdquo ldquotheyrdquo 887ldquoyoungrdquo ldquocanadiansrdquo ldquowhilerdquo 887 ldquoshowcaserdquo ldquoofrdquoldquoworkrdquo 887 ldquoexcellentrdquo ldquoshowcaserdquo ldquoofrdquo 887 hellip

Here we see an array of common language funding informa-tion link information project information spread across over 887

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

websites Reading down we see information about governments jobexperience project information initiatives re-structuring and soforth At a glance we can see what language is common to all web-sites as opposed to unique information Moving down we find themore common but less universal ldquothe history ofrdquo for example ldquoher-itage community foundationrdquo libraries and so forth By the end ofthe list we can move into the unique or very rare phrases that appearonly once twice or three times

Moving beyond word and phrase frequency there are a fewmore sophisticated off-the-shelf options that can help us make senseof large arrays of unstructured data Topic modeling is among themost developed as Princetonrsquos David M Blei (a Computer Scienceprofessor) describes in a general introduction to the topic

While more and more texts are available online we simplydo not have the human power to read and study them toprovide the kind of [traditional] browsing experiencedescribed above To this end machine learning researchershave developed probabilistic topic modeling a suite of algo-rithms that aim to discover and annotate large archives ofdocuments with thematic information Topic modelingalgorithms are statistical methods that analyze the words ofthe original texts to discover the themes that run throughthem how those themes are connected to each other andhow they change over time hellip Topic modeling enables usto organize and summarize electronic archives at a scalethat would be impossible by human annotation55

Using the aforementioned SEASR Meandre workbench (or viaMALLET itself and a command-line interface) we can quickly beginto run our born-digital information through the topic-modelingalgorithm The results are notable and come in two formats a wordcloud visualization as well as an XML file containing a line-by-linebreakdown of each topic Taking the aforementioned file 267 (ldquopastand presentrdquo) we can begin to get a sense of the topics this large web-site discusses

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

At a glance we can learn about some of the main themes the struc-ture of Alberta (Alberta towns bridges lakes) the geographic shapeof the area (life land lands areas links grass and interestingly lan-guage) the stories (story knight archives but also interestinglyknight and wife) and so forth The word cloud gives us a sense ofthe most relevant terms that define each topic but we also have a3200-line XML file broken into topics which lets us further refineour inquiry

A last set of examples will help us move from the aggregate levelof hundreds of thousands of lines of text to the individual line inessence creating our own search engine to navigate this collection

There are a few good reasons why we should not use Google todo all of this in our own work First we do not know how Googleworks precisely It is a proprietary black box While the original algo-rithm is available the actual methods that go into determining theranking of search terms is a trade secret Second Google is optimizedfor a number of uses mainly to deliver the most relevant informa-tion to the user Historians and archivists are looking often forspecific information within old websites and it behooves us to exam-ine them through the lens of more specialized tools With thesehesitations in mind I have sought to develop my own rudimentarysearch engine for this born-digital collection

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A two-pronged approach can take us to relevant informationquickly In the following example I am carrying out a project onblack history We quickly search all of the CDC word frequency filesfor the occurrence of the word ldquofreedomrdquo ldquoblackrdquo and ldquolibertyrdquo(these terms of course require some background knowledge thesearch is shaped by the user) obtain a list and then map it onto achart in order to quickly visualize their locations in the collection Forexample in the graph above we see a visualization of a given term inthis case our three search terms As one moves the mouse over eachbar the frequency data for a given term appears We can also zoomand enhance this graph to use it as we wish A demonstration of thiscan be found online at wwwianmilliganca20130118cdc-tool-1

We can then move down into the data In the example aboveldquo047rdquo labeled ldquoblackgoldtxtrdquo contains 31 occurrences of the wordldquoontariordquo Moving to my Mathematica keyword-in-context routinewe can see where these occurrences of the word appear

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

A snapshot of the dynamic finding aid tool

We see that this website focuses mostly on southwestern Ontariowhere many oil fields are If we move forward we can see that it per-tains to the nineteenth century a particular region as well assubsequent memorialization activities (mention of a museum forexample) All this can be done from within the textual client asopposed to going out into individual websites Broad contoursemerge from this distant reading but they also facilitate deeper read-ing by allowing us to promptly realize critical sources

Another model through which we can visualize documents isthrough a viewer primarily focusing on term document frequencyand inverse frequency terms discussed below To prepare this mater-ial we are combining our earlier data about word frequency (that theword ldquoexperiencerdquo appears 919 times perhaps) with aggregate infor-mation about frequency in the entire corpus (ie the wordldquoexperiencerdquo appears 1766 times) The number of times that a termappears in a given website can be defined as tf (term frequency)Another opportunity is inverse document frequency where we canlearn what words are most distinct and unique in a given documentThe formula for this is the oft-cited tf-idf a staple of informationretrieval theory or tf times the logarithm of the number of docu-ments divided into the number of documents where the given termappears At a glance then we can see some of the unique words con-tained within a website if we select ldquoinverse document frequencyrdquo

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

At a very quick glance one can see the major words that make this aunique document This technique can be used on any textual docu-ment

Off-the-shelf tools can facilitate a deeper yet distant reading ofindividual websites as well especially the suite of research tools onwebsites such as Many Eyes and Voyant Using these tools howeverrequires that one puts individual data into each separate programrather than quickly navigating between them This stems from themassive amount of files and data involved Many Eyes limits files to5MB and Voyant has no hard and fast rule but can struggle withlarger files

This analysis was carried out on my personal computer55 Largerdata sets may necessitate a corresponding increase in computationalpower bringing social sciences and humanities researchers into con-tact with High Performance Computing (HPC) centres or consortiaThese extremely powerful computers provide researchers with com-puting power orders of magnitude above that of personal computersallowing results in hours instead of the weeks or months it might

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

take to process on a smaller system56 This has particular applicationin the basic processing of a very large web archive There are limita-tions however jobs must be submitted into queues in advancerather than on-site processing and perhaps most importantly HPCcentres prefer to run highly optimized code to minimize usage oftheir very expensive and scarcely allocated equipment As this level ofcoding expertise is often beyond humanities researchers researchteams must be formed

Humanities researchers would do well to familiarize themselveswith local HPC options many of which are more than happy towork on new and engaging questions These are accessible optionsin Ontario for example all but one university belongs to a HPCconsortium57 Yet as personal computing continues to dramaticallyincrease in speed and as storage prices continue to plummet textualanalysis is increasingly feasible on high-quality office systems An ini-tial exposure can generate questions and problems that if the needrears itself can subsequently be taken to the supercomputer

This case study is intended to provide one model for distantreading and the generation of finding aids in the interest of showingsome of the potential for working with born-digital sources It isintended to begin or jumpstart a conversation around the resourcesand skills necessary for the next generation A seemingly insur-mountable amount of information contained across over 73000files (as in the CDC collection) can be quickly navigated and iso-lated for researcher convenience

Conclusion Digital Literacy for the Next Generation ofHistorians

As a discipline I would like to suggest one path forward prepare thenext generation of historians at least some of them to deal with thedeluge of born-digital sources Fundamentally the challenge of born-digital sources is a pedagogical one For existing historians this is anissue of continual education for future historians an issue of profes-sional training (the annual Digital Humanities Summer Instituteheld at the University of Victoria each June is worth considering asan introduction to many digital concepts) Some off-the-shelf tools

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

can serve as gateways to deeper engagement notably the VoyantTools suite (wwwvoyant-toolsorg) and perhaps subsequently thelessons contained in the Programming Historian 2 That said digitalliteracy courses should become part of the essential pantheon of his-torical curricula This should not be mandatory but instead taughtalongside and within traditional historiography and research meth-ods courses In an era of diminishing resources and attention paid toknowledge translation and relevancy digital literacy courses are alsowell positioned to provide graduates with concrete portfolios andactionable skills If we want to be prepared to use digital method-ologies to study the post-1990 world we will need ready historians

Such a course should have six basic priorities First we must pro-vide historiographical context for new students on whatcomputational methods have been tried before How have quantita-tive and qualitative methods differed yet ultimately complementedeach other Second students in this course would need to develop anunderstanding of the digital basics cloud computing effectivebackup and versioning digital security and an understanding of howthe Internet is organized Third for effective cross-departmental andsustained engagement with digital history students must to learn howto digitally navigate conventional sources notable skill sets wouldinclude a familiarity with citation management software databasesand the possibility of using optical character recognition on a varietyof born-paper sources subsequently digitized through scanners orcameras This must also include a grasp of the scope and significanceof large digital depositories including Google Books the InternetArchive and others mentioned earlier Fourth students must learnhow to process large amounts of digital information this includes off-the-shelf visualization tools material discussed here and some of thephilosophical and methodological implications of a distance readingFifth students can learn basic programming through the ProgrammingHistorian Finally and perhaps most importantly students need anunderstanding of various ways to openly disseminate material on the web whether through blogging micro-blogging publishing webpages and so forth it is important to enable the great work that we do in the university to make its mark on the world Throughthis students will also come away from history departments with

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

comprehensive and marketable digital portfolios Such a coursewould focus more on homework than reading with special emphasisplaced on deliverable content such as websites interactive modelsand broadly implementable databases

Digital history and the readiness to work with born-digitalsources cannot stem out of pedagogy alone however Institutionalsupport is necessary While based on anecdotal evidence it appearsthat some university administrators have begun to embrace the shifttowards the digital humanities At the federal level SSHRC has iden-tified the Digital Economy as one of its priority areas58 Yet when itcomes to evaluation doctoral and post-doctoral researchers do nothave a ldquodigital humanitiesrdquo option open to them in terms of assessingcommittees leading to fears that their projects might be lost inbetween disciplinary areas of focus59 In light of the impending crisisof born-digital sources historians should encourage these initiatives

These sorts of digital projects do not always lend themselveswell to traditional publication venues The examples above from theCDC case study for example are best viewed as dynamically mal-leable models this will be one challenge as historians grapple withnew publication models This article argues first why digitalmethodologies are necessary and second how we can realize aspectsof this call The Internet is 20 years old and before historians realizeit we will be faced with the crisis of born-digital information over-load If history is to continue as the leading discipline inunderstanding the social and cultural past decisive movementtowards the digital humanities is necessary When the sea comes inwe must be ready to swim

IAN MILLIGAN is an assistant professor of Canadian history at theUniversity of Waterloo He is a founding co-editor of ActiveHistorycaan editor-at-large of the Programming Historian 2 and has written sev-eral articles in the fields of Canadian youth digital and labour history

IAN MILLIGAN est professeur adjoint en histoire canadienne agrave laUniversity of Waterloo Il est le fondateur et codirecteur du site

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Internet ActiveHistoryca ainsi qursquoun eacutediteur du site ProgrammingHistorian 2 Il a eacutecrit plusieurs articles au sujet de la jeunesse cana-dienne de la numeacuterisation et de lrsquohistoire des travailleurs

Endnotes

1 The question of when the past becomes a ldquohistoryrdquo to be professionallystudied is a tricky one Yet it is worth noting that the still-definitivework on the Baby Boom Generation appeared in 1996 with DougOwram Born at the Right Time A History of the Baby Boom Generationin Canada (Toronto University of Toronto Press 1996) This actuallyfollowed the earlier scholarly monograph Cyril Levitt Children ofPrivilege Student Revolt in the Sixties (Toronto University of TorontoPress 1984) and a host of other laudable works As academicdebatesworks on the 1960s now suggest the past becomes historysooner than we often expect

2 A critical point advanced in the field of literary criticism and theory byStephen Ramsay Reading Machines Towards an Algorithmic Criticism(Urbana University of Illinois Press 2011)

3 See ldquoIdleNoMore Timelinerdquo ongoingwwwidlenomoremakookcatimeline44 ltviewed 18 January 2013gt

4 These issues have been discussed elsewhere albeit in either a moreabridged form or with a focus on digitized historical sources A notableexception is Daniel J Cohen and Roy Rosenzweig Digital History AGuide to Gathering Preserving and Presenting the Past on the Web(Philadelphia University of Pennsylvania Press 2005) specifically theirchapter on ldquoCollecting History onlinerdquo Their focus is on the preservation-ist and archival challenges however Important notes are also found inDan Cohen ldquoHistory and the Second Decade of the Webrdquo RethinkingHistory 8 no 2 (June 2004) 293ndash301 also available online atwwwchnmgmueduessays-on-history-new-mediaessaysessayid=34

5 Figure taken from Jose Martinez ldquoTwitter CEO Dick Costolo Reveals Staggering Number of Tweets Per Dayrdquo ComplexTech wwwcomplexcomtech201210twitter-ceo-dick-costolo-reveals-staggering-number-of-tweets-per-day ltviewed 6 January 2013gtFurther more while a tweet itself is only 140 characters every tweetcomes with over 40 lines of metadata date information author biogra-phy and location information when the account was created the usersrsquofollowing and follower numbers country data and so forth This is aremarkable treasure trove of information for future historians but also apotential source of information overload See Sarah Perez ldquoThis is What

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

a Tweet Looks Likerdquo ReadWriteWeb (19 April 2010) wwwreadwritewebcomarchivesthis_is_what_a_tweet_looks_likephpltviewed 6 January 2013gt

6 This is further discussed in Ian Foster ldquoHow Computation ChangesResearchrdquo in Switching Codes Thinking Through Digital Technology inthe Humanities and the Arts eds Thomas Bartscherer and RoderickCoover (Chicago University of Chicago Press 2011) esp 22ndash4

7 Franco Moretti Graphs Maps Trees Abstract Models for Literary History(New York Verso 2005) 3ndash4

8 While not of prime focus here the longue dureacutee deserves definition Itstands opposite the history of events of instances covering instead avery long time span in an interdisciplinary social science framework In along essay Braudel noted that one can then see both structural crises ofeconomies and structures that constrain human society and develop-ment For more see Fernand Braudel ldquoHistory and the Social SciencesThe Longue Dureacuteerdquo in Fernand Braudel On History trans SarahMatthews (Chicago University of Chicago Press 1980) 27 The essaywas originally printed in the Annales ESC no 4 (OctoberndashDecember1958)

9 A great overview can be found at ldquoText Analysisrdquo Tooling Up for DigitalHumanities 2011 wwwtoolingupstanfordedupage_id=981 ltviewed19 December 2012gt

10 Jean-Baptiste Michel Yuan Kui Shen Aviva Presser Aiden Adrian VeresMatthew K Gray The Google Books Team Joseph P Pickett DaleHoiberg Dan Clancy Peter Norvig Jon Orwant Steven Pinker MartinA Nowak Erez Lieberman Aiden ldquoQuantitative Analysis of CultureUsing Millions of Digitized Booksrdquo Science 331 (14 January 2011)176ndash82 The tool is publicly accessible at ldquoGoogle Books NgramViewerrdquo Google Books 2010 wwwbooksgooglecomngrams ltviewed11 January 2013gt

11 Anthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

12 Comment by Jean-Baptiste Michel and Erez Lieberman Aiden onAnthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

13 For more information see the Canadian Century Research InfrastructureProject website at wwwccriuottawacaCCRIHomehtml and the Programme de recherche en deacutemographie historique at wwwgeneologyumontrealcafrleprdhhtm

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

14 ldquoWith Criminal Intentrdquo CriminalIntentorg wwwcriminalintentorgltviewed 20 December 2012gt

15 For an overview see ldquoAward Recipients mdash 2011rdquo Digging into DataWebpage wwwdiggingintodataorgHomeAwardRecipients2011tabid185Defaultaspx ltviewed 20 December 2012gt

16 One can do no better than the completely free Programming Historianwhich uses entirely free and open-source programming tools to learn thebasics of computational history See The Programming Historian 2 2012-Present wwwprogramminghistorianorg ltviewed 11 January 2013gt

17 Michael Katz The People of Hamilton Canada West Family and Class ina Mid-Nineteenth-Century City (Cambridge MA Harvard UniversityPress 1975) Also see A Gordon Darroch and Michael D OrnsteinldquoEthnicity and Occupational Structure in Canada in 1871 The VerticalMosaic in Historical Perspectiverdquo Canadian Historical Review 61 no 3(1980) 305ndash33

18 Ian Anderson ldquoHistory and Computingrdquo Making History The ChangingFace of the Profession in Britain 2008 wwwhistoryacukmakinghistoryresourcesarticleshistory_and_computinghtml ltviewed 20 December2012gt

19 Christopher Collins ldquoText and Document Visualizationrdquo TheHumanities and Technology Camp Greater Toronto Area VisualizationBootcamp University of Ontario Institute of Technology Oshawa Ont21 October 2011

20 The quotations by Luther and Poe are commonly used to make thispoint For the best example see Clay Shirky Cognitive SurplusCreativity and Generosity in a Connected Age Google eBook Penguin2010 For Mumford see Lewis Mumford The Myth of the Machine vol2 The Pentagon of Power (New York Harcourt Brace 1970) 182 asquoted in Gleick The Information A History A Theory A Flood (NewYork Pantheon 2011) 404

21 Adam Ostrow ldquoCisco CRS-3 Download Library of Congress in JustOver One Secondrdquo Mashable (9 March 2010) accessed onlinewwwmashablecom20100309cisco-crs-3

22 A note on data size is necessary The basic unit of information is a ldquobyterdquo (8 bits) which is roughly a single character A kilobyte (KB) is athousand bytes a megabyte (MB) a million a gigabyte (GB) a billion aterabyte (TB) a trillion and a petabyte (PB) a quadrillion Beyond thisthere are exabytes (EB) zettabytes (ZB) and yottabytes (YB)

23 These figures can of course vary depending on compression methodschosen Their math was based on an average of 8MB per book if it wasscanned at 600dpi and then compressed There are 26 million books inthe Library of Congress For more see Mike Ashenfelder ldquoTransferringlsquoLibraries of Congressrsquo of Datardquo Library of Congress Digital Preservation

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Blog 11 July 2011wwwblogslocgovdigitalpreservation201107transferring-libraries-of-congress-of-data ltviewed 11 January 2013gt

24 Gleick 232 25 ldquoAbout the Library of Congress Web Archivesrdquo Library of Congress Web

Archiving FAQs Undated wwwlocgovwebarchivingfaqhtmlfaqs_02ltviewed 11 January 2013gt

26 ldquo10000000000000000 Bytes Archivedrdquo Internet Archive Blog 26October 2012 wwwblogarchiveorg2012102610000000000000000-bytes-archivedltviewed 19 December 2012gt

27 Jean-Baptiste Michel et al 176ndash82 Published Online Ahead of Print16 December 2010 wwwsciencemagorgcontent3316014176ltviewed 29 July 2011gt See also Scott Cleland ldquoGooglersquos PiracyLiabilityrdquo Forbes 9 November 2011 wwwforbescomsitesscottcle-land20111109googles-piracy-liability ltviewed 11 January 2013gt

28 ldquoGovernment of Canada Web Archiverdquo Library and Archives Canada 17October 2007 wwwcollectionscanadagccawebarchivesindex-ehtmlltviewed 11 January 2013gt

29 For background see Bill Curry ldquoVisiting Library and Archives inOttawa Not Without an Appointmentrdquo Globe and Mail 1 May 2012wwwtheglobeandmailcomnewspoliticsottawa-notebookvisiting-library-and-archives-in-ottawa-not-without-an-appointmentarticle2418960 ltviewed 22 May 2012gt and the announcement ldquoLAC Begins Implementation of New Approach to service Deliveryrdquo Library and Archives Canada February 2012wwwcollectionscanadagccawhats-new013-560-ehtml ltviewed 22May 2012gt For professional responses see the ldquoCHArsquos comment onNew Directions for Library and Archives Canadardquo CHA Website March2012 wwwcha-shccaenNews_39items19html ltviewed 19December 2012gt and ldquoSave Library amp Archives Canadardquo CanadianAssociation of University Teachers launched AprilMay 2012 wwwsavelibraryarchivesca ltviewed 19 December 2012gt

30 Ian Milligan ldquoThe Smokescreen of lsquoModernizationrsquo at Library andArchives Canadardquo ActiveHistoryca 22 May 2012wwwactivehistoryca201205the-smokescreen-of-modernization-at-library-and-archives-canada ltviewed 19 December 2012gt

31 ldquoSave LAC May 2012 Campaign Updaterdquo SaveLibraryArchivesca May2012 wwwsavelibraryarchivescaupdate-2012-05aspx ltviewed 19December 2012gt

32 The challenges and opportunities presented by such projects is extensively discussed in Daniel J Cohen ldquoThe Future of Preserving the Pastrdquo CRM The Journal of Heritage Stewardship 2 no 2

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

(Summer 2005) 6ndash19 also available freely online at the RoyRosenzweig Center for History and New Mediawwwchnmgmueduessays-on-history-new-mediaessaysessayid=39For more see ldquoSeptember 11 Digital Archive Saving the Histories of September 11 2001rdquo Centre for History and New Media andAmerican Social History Project Center for Media and Learningwww911digitalarchiveorg ltviewed 11 January 2013gt ldquoHurricaneDigital Memory Bank Collecting and Preserving the Stories of Katrina and Ritardquo Center for History and New Media wwwhurricanearchiveorg ltviewed 11 January 2013gt and ldquoOccupyArchive Archiving the Occupy Movements from 2011rdquo Centre forHistory and New Media wwwoccupyarchiveorg ltviewed 11 January2013gt

33 Daniel Cohen and Roy Rosenzweig Digital History A Guide to Gathering Preserving and Presenting the Past on the Web(Philadelphia Temple University Press 2005) chapter on ldquoPreservingDigital Historyrdquo The chapter is available fully online atwwwchnmgmuedudigitalhistorypreserving5php ltviewed 19December 2012gt

34 For an overview see ldquoPreserving Our Digital Heritage The NationalDigital Information Infrastructure and Preservation Program 2010 Report a Collaborative Initiative of the Library of CongressrdquoDigitalpreservationgov wwwdigitalpreservationgovmultimediadocumentsNDIIPP2010Report_Postpdf ltviewed 19 December 2012gt

35 Steve Meloan ldquoNo Way to Run a Culturerdquo Wired 13 February 1998available online via the Internet Archivewwwwebarchiveorgweb20000619001705httpwwwwiredcomnewsculture012841030100html ltviewed 19 December 2012gt

36 Some of this has been discussed in greater depth on my website See theseries beginning with Ian Milligan ldquoWARC Files A Challenge forHistorians and Finding Needles in Haystacksrdquo ianmilliganca 12December 2012 wwwianmilliganca20121212warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks ltviewed 20 December2012gt

37 Unstructured data as opposed to structured data is important for histo-rians but brings additional challenges Some projects have producedstructured data novels marked up using XML encoding for example soone can immediately see place names dates character names and soforth This makes it much easier for a computer to read and understand

38 See William Noblett ldquoDigitization A Cautionary Talerdquo New Review ofAcademic Librarianship 17 (2011) 3 Tobias Blanke Michael Bryantand Mark Hedges ldquoOcropodium Open Source OCR for Small-ScaleHistorical Archivesrdquo Journal of Information Science 38 no 1 (2012) 83

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 8: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

prepared to work on larger truly interdisciplinary teams Computer sci-entists statisticians and mathematicians for example have criticalmethodological and analytical contributions to make towards questionsof historical inquiry Historians need to forge further links with thesedisciplines Such meaningful collaboration however will require histo-rians who can speak their languages and who have an understanding oftheir work Some of us must be able to discuss computational or algo-rithmic methodologies when we collectively explore the past

The methodologies discussed in this paper fall into the twinfields of data mining and textual analysis The former refers to thediscovery of patterns or information about large amounts of infor-mation while the latter refers to the use of digital tools to analyzelarge quantities of text9 The case study here draws upon three mainconcepts information retrieval (quickly finding what you want)information and term frequency (how often do various concepts andterms appear) and topic modeling (finding frequently occurringbags of words throughout documents) Other projects have involvedauthorship attribution which is finding out the author of docu-ments based on previous patterns style analysis or creating networkgraphs of topics groups documents or individuals

The Culturomics project run by the Cultural Observatory atHarvard University is a notable model of collaboration on a big his-tory project As laid out in Science and put online as the GoogleBooks n-gram viewer (wwwbooksgooglecomngrams) this projectindexed word and phrase frequency across over five million booksenabling researchers to trace the rise and fall of cultural ideas andphenomena through targeted keyword and phrase searches10 This isthe most ambitious and certainly the most advanced and accessibleBig History project yet launched Yet amongst the extensive authorlist (13 individuals and the Google Books team) there were no his-torians This suggests that other disciplines in the humanities andsocial sciences have embraced the digital turn far more than histori-ans However as I mentioned in the coming decades historians willincreasingly rely on digital archives There is a risk that history as aprofessional discipline could be left behind

American Historical Association President Anthony Graftonnoted this absence in a widely circulated Perspectives article Why was

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

there no historian he asked when this was both primarily a histori-cal project and a team comprised of varied doctoral holders fromdisciplines such as English literature psychology computer sciencebiology and mathematics12 Culturomics project leaders ErezLeiberman Aiden and Jean-Baptiste Michel responded franklyWhile the team approached historians and some did play advisoryroles every author listed ldquodirectly contributed to either the creationof the collection of written texts (the lsquocorpusrsquo) or to the design andexecution of the specific analyses we performed No academic histo-rians met this barrdquo They continued to indict the profession

The historians who came to the meeting were intelligentkind and encouraging But they didnrsquot seem to have agood sense of how to wield quantitative data to answerquestions didnrsquot have relevant computational skills anddidnrsquot seem to have the time to dedicate to a big multi-author collaboration Itrsquos not their fault these things donrsquotappear to be taught or encouraged in history departmentsright now13

This is a serious indictment considering the pending archival shift todigital sources and one that still largely holds true (a few historydepartments now offer digital history courses but they are still out-liers) This form of history the massive analysis of digitizedinformation will continue Furthermore these approaches willattract public attention as they lend themselves well to accessibleonline deployment Unless historians are well positioned they riskceding professional ground to other disciplines

While historians were not involved in Culturomics they arehowever involved in several other digital projects Two major andambitious long-term Canadian projects are currently using compu-tational methods to reconstruct the demographic past The first isthe Canadian Century Research Infrastructure project which is cre-ating databases from five censuses and aiming to provide ldquoa newfoundation for the study of social economic cultural and politicalchangerdquo while the second is the Programme de recherche en deacutemo-graphie historique hosted by the Universiteacute de Montreacuteal13 They are

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

reconstructing the European population of Quebec in the seven-teenth and eighteenth centuries drawing heavily on parish registersFurthermore several projects in England have harnessed the wave ofinterest in ldquobig datardquo to tell compelling historical stories about theirpasts the Criminal Intent project for example used the proceedingsof the Old Bailey courts (the main criminal court in Greater London)to visualize large quantities of historical information14 The CriminalIntent project was itself funded by the Digging into Data Challengewhich brings together multinational research teams and fundingfrom federal-level funding agencies in Canada the United States theUnited Kingdom and beyond Several other historical studies havebeen funded under their auspices from studies of commodity tradein the eighteenth- and nineteenth-century Atlantic world (involvingscholars at the University of Edinburgh the University of StAndrews and York University) to data mining the degrees of eco-nomic opportunity and spatial mobility in Britain Canada and theUnited States (bringing researchers at the University of Guelph theUniversity of Leicester and the University of Minnesota together)15

These exciting interdisciplinary teams may be heralding a disciplinalshift although it is still early days

Collaborative teams have limitations though We will have tobe prepared to tackle digital sources alone on shoestring budgetsHistorians will not always be able to call on external programmers orspecialists to do work because teams often develop out of fundinggrants As governmental and university budgets often prove unableto meet consistent research needs baseline technical knowledge isrequired for independent and unfunded work If historians relyexclusively on outside help to deal with born-digital sources the abil-ity to do computational history diminishes As the success of theProgramming Historian indicates and as I will demonstrate in mycase study below historians can learn basic techniques16 Let ustweak our own algorithms

A social or cultural history of the very late-twentieth or twenty-first century will have to account for born-digital sources andhistorians will (in conjunction with archival professionals) have tocurate and make sense of the data Historians need to be able toengage with these sources themselves or at least as part of a team

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

rather than having sources mediated through the lens of outside dis-ciplines or commercial interests Algorithms may be one meaningfulway to make sense of cultural trends or extract meaningful informa-tion from billions of tweets but we must ensure professionalhistorical input goes into their crafting

The Third Wave of Computational History

Historians have fruitfully used computers in the past as part of twoprevious waves The 1960s and 1970s saw computational tools usedfor demographic population and economic histories a trend that inCanada saw full realization in comprehensive studies such as MichaelKatzrsquos The People of Hamilton Canada West17 As Ian Anderson hasconvincingly noted in a review of history and computing studies ofthis sort saw the field become associated with quantitative studies18

After a retreat from the spotlight computational history rose again inthe 1990s with the advent of the personal computer graphical userinterfaces and overall improvements in ease of use Punch cards couldnow give way to database programs GIS and early online networkssuch as H-Net and Usenet We are now I believe on the cusp of athird revolution in computational history thanks to three main fac-tors decreasing storage costs the power of the Internet anddistributed cloud computing and the rise of professionals dealingwith both digital preservation and open-source tools

Decreasing storage costs have led to massive amounts of infor-mation being preserved as Gleick noted in this articlersquos epigraph Itis important to note that this information overload is not newPeople have long worried about the impact of too much informa-tion19 In the sixteenth century responding to the rise of the printingpress the German priest Martin Luther decried that the ldquomultitudeof books [were] a great evilrdquo In the nineteenth century Edgar AllanPoe bemoaned ldquo[t]he enormous multiplication of books in everybranch of knowledge is one of the greatest evils of this agerdquo Asrecently as 1970 American historian Lewis Mumford lamented ldquotheoverproduction of books will bring about a state of intellectual ener-vation and depletion hardly to be distinguished from massiveignorancerdquo20 The rise of born-digital sources must thus be seen in

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

this continuous context of handwringing around the expansion andrise of information

Useful historical information is being preserved at a rate that isaccelerating with every passing day One useful metric to comparedataset size is that of the American Library of Congress (LOC)Indeed a ldquoLOCrdquo has become shorthand for a unit of measurementin the digital humanities Measuring the actual extant of the printedcollection in terms of data is notoriously difficult James Gleick hasclaimed it to be around 10TB22 whereas the Library of Congressitself claims a more accurate figure is about 200TB23 While thesetwo figures are obviously divergent neither is on an unimaginablescale mdash personal computers now often ship with a terabyte or moreof storage When Claude Shannon the father of information theorycarried out a thought experiment in 1949 about items or places thatmight store ldquoinformationrdquo (an emerging field which viewed messagesin their aggregate) he scaled sites from a digit upwards to the largestcollection he could imagine The list scaled up logarithmically from10 to the power of 0 to 10 to the power of 13 If a single-spaced pageof typing is 104 107 the Proceedings of the Institute of Radio Engineers109 the entire text of the Encyclopaedia Brittanica then 1014 was ldquothelargest information stockpile he could think of the Library ofCongressrdquo24

But now the Library of Congressrsquo print collection mdash somethingthat has taken two centuries to gather mdash is dwarfed by their newestcollection of archived born-digital sources the vast majority only amatter of years old compared to the much wider date-range of non-digital sources in the traditional collection The LOC has beguncollecting its own born-digital Internet archive multiple snapshotsare taken of webpages to show how they change over time Even asa selective curated collection drawing on governmental politicaleducational and creative websites the LOC has already collected254TB of data and adds 5TB a month25 The Internet Archivethrough its Wayback Machine is even more ambitious It seeks toarchive every website on the Internet While its size is also hard toput an exact finger on as of late 2012 it had over 10 Petabytes ofinformation (or to put it into perspective a little over 50 Library ofCongresses if we take the 200TB figure)26

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

This is to some degree comparing apples and oranges theLibrary of Congress is predominantly print whereas the InternetArchive has considerable multimedia holdings which includesvideos images and audio files The LOC is furthermore a morecurated collection whereas the Internet Archive draws from a widerrange of producers The Internet offers the advantages and disadvan-tages of being a more democratic archive For example a Geocitiessite (an early way for anybody to create their own website) created bya 12-year old Canadian teenager in the mid-1990s might be pre-served by the Internet Archive whereas it almost certainly would notbe saved by a national archive This difference in size and collectionsmanagement however is at the root of the changing historiansrsquotoolkit If we make use of these large data troves we can access a newand broad range of historical subjects

Conventional sources are also part of this deluge Google Bookshas been digitizing the printed word and it currently has 15 millionworks completed by 2020 it audaciously aims to digitize every bookever written While copyright issues loom the project will allowresearchers to access aggregate word and phrase data as seen in theGoogle n-gram project27 In Canada Library and Archives Canada(LAC) has a more circumscribed digitization project collecting ldquoarepresentative sample of Canadian websitesrdquo focusing particularlyon Government of Canada webpages as well as a curated approach

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

preserving case studies such as the 2006 federal election Currentlyit has 4TB of federal government information and 7TB of totalinformation in its larger web archive (a figure which as I note belowis unlikely to drastically improve despite ostensible goals of archivallsquomodernizationrsquo)28

In Canada LAC has mdash on the surface mdash has been advancingan ambitious digitization program as it realizes the challenges nowfacing archivists and historians Daniel J Caron the Librarian andArchivist of Canada has been outspoken on this front with severalpublic addresses on the question of digital sources and archivesThroughout 2012 primary source digitization at LAC preoccupiedcritics who saw it as a means to depreciate on-site access TheCanadian Association of University Teachers the Canadian HistoricalAssociation as well as other professional organizations mountedcampaigns against this shift in resources29 LAC does have a pointhowever with their modernization agenda The state of Canadarsquosonline collections are small and sorely lacking when compared totheir on-site collections and LAC does need to modernize and achievethe laudable goal of reaching audiences beyond Ottawa

Nonetheless digitization has unfortunately been used as asmokescreen to depreciate overall service offerings30 The modern-ization agenda would also see 50 percent of the digitization staffcut31 LACrsquos online collections are small they do not work withonline developers through Application Programming Interfaces(APIs a way for a computer program to talk to another computerand speed up research mdash Canadianaca is incidentally a leader in thisarea) and there have been no promising indications that this state ofaffairs will change Digitization has not been comprehensive Onlinepreservation is important and historians must fight for both on-siteand on-line access

A number of historians have recognized this challenge and areplaying an instrumental role in preserving the born-digital past TheRoy Rosenzweig Center for History and New Media at GeorgeMason University launched several high-profile preservation pro-jects the September 11th Digital Archive the Hurricane DigitalMemory Bank (focusing on Hurricanes Katrina and Rita) and as ofwriting the Occupy archive32 Massive amounts of online content

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

is curated and preserved photographs news reports blog posts andnow tweets These complement more traditional efforts of collectingand preserving oral histories and personal recollections which arethen geo-tagged (tied to a particular geographical location) tran-scribed and placed online These archives serve a dual purposepreserving the past for future generations while also facilitating easydissemination for todayrsquos audiences

Digital preservation however is a topic of critical importanceMuch of our early digital heritage has been lost Many websites espe-cially those before the Internet Archiversquos 1996 web archiving projectlaunch are completely gone Early storage mediums (from tape toearly floppy disks) have become obsolete and are now nearly inac-cessible Compounding this software development has seen earlyproprietary file formats depreciated ignored and eventually madewholly inaccessible33 Fortunately the problem of digital preservationwas recognized by the 1990s and in 2000 the American governmentestablished the National Digital Information Infrastructure andPreservation Program34 As early as 1998 digital preservationistsargued ldquo[h]istorians will look back on this era hellip and see a period ofvery little information A lsquodigital gaprsquo will span from the beginningof the wide-spread use of the computer until the time we eventuallysolve this problem What wersquore all trying to do is to shorten thatgaprdquo To bear home their successes that quotation is taken from anInternet Archive snapshot of Wired magazinersquos website35

Indeed greater attention towards rigorous metadata (which isbriefly put data about data often preserved in plain text format)open-access file formats documented by the InternationalOrganization for Standardization (ISO) and dedicated digitalpreservation librarians and archivists gives us hope for the future36

It is my belief that cloud storage the online hosting of data in largethird-party offsite centres mitigates the issue of obsolete storagemedium Furthermore that the move towards open source file for-mats even amongst large commercial enterprises such as Microsoftprovides greater documentation and uniformity

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A Brief Case Study Accessing and Navigating Canadarsquos DigitalCollections With an Emphasis on Tools for Born-DigitalSources

What can we do about this digital deluge There are no simpleanswers but historians must begin to conceptualize new additions totheir traditional research and pedagogical toolkits In the section thatfollows I will do two things introduce a case study and several toolsboth proprietary and those that are open and available off-the-shelfto reveal one way forward

For a case study I have elected to use Canadarsquos DigitalCollections (CDC) a collection of lsquodeadrsquo websites removed fromtheir original location on the World Wide Web and now digitallyarchived at LAC In this section I will introduce the collection usingdigitally-archived sources from the early World Wide Web and dis-cuss my research methodology before illustrating how we can derivemeaningful data from massive arrays of unstructured data37

This is a case study of a web archive Several of these method-ologies could be applied to other formats including large arrays oftextual material whether born-digital or not A distant reading of thistype is not medium specific It does however assume that the text ishigh quality Digitized primary documents do not always have textlayers and if they are instituted computationally through OpticalCharacter Recognition errors can appear Accuracy rates for digitizedtextual documents can range from as low as 40 percent for seven-teenth and eighteenth century documents to 80ndash90 percent formodern microfilmed newspapers to over 99 percent for typesetword-processed documents38 That said while historians today maybe more interested in applying distant reading methodologies tomore traditional primary source materials it is my belief that in thenot-so-distant future websites will be an increasingly significantsource for social cultural and political historians

What are Canadarsquos Digital Collections In 1996 IndustryCanada funded by the federal Youth Employment Strategy facili-tated the development of Canadian heritage websites by youth(defined as anyone between the age of 15 and 30) It issued contractsto private corporations that agreed to use these individuals as web

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

developers for the twofold purpose of developing digital skills andinstilling respect for and engagement with Canadian heritage andpreservation39 The collection initially called SchoolNet butrebranded as Canadarsquos Digital Collections in 1999 was accessiblefrom the CDC homepage After 2004 the program was shut downand the vast majority of the websites were archived on LACrsquos web-page This collection is now the largest born-digital Internet archivethat LAC holds worthy of study as content as well as thinkingabout methodological approaches

Conventional records about the program are unavailable in thetraditional LAC collection unsurprising in light of both its recentcreation and backlogs in processing and accessioning archival mate-rials We can use one element of our born-digital toolkit the InternetArchiversquos WaybackMachine (its archive of preserved early webpages)to get a sense of what this project looked like in its heyday The mainwebpage for the CDC was wwwcollectionsicgc However a visitortoday will see nothing but a collection of broken image links (two ofthem will bring you to the English and French-language versions ofthe LAC-hosted web archive although this is only discerniblethrough trial and error or through the sitersquos source code)

With the WaybackMachine we have a powerful albeit limitedopportunity to ldquogo back in timerdquo and experience the website as previ-ous web viewers would have (if a preserved website is available ofcourse) Among the over 420 million archived websites we can searchfor and find the original CDC site Between 12 October 1999 andthe most recent snapshot or web crawl (of the now-broken site) on 6July 2011 the site was crawled 410 times40 Unfortunately every siteversion until the 15 October 2000 crawl has been somehow cor-rupted Content has been instead replaced by Chinese characterswhich translate only into place names and other random words Issuesof early digital preservation noted above have reared their head inthis collection

From late 2000 onward however fully preserved CDC sites areaccessible Featured sites new entries success stories awards curric-ular units and prominent invitations to apply for project fundingpopulate the brightly coloured website The excitement is palpableJohn Manley then Industry Minister declared ldquoit is crucial that

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Canadarsquos young people get the skills and experience they need to par-ticipate in the knowledge-based economy of the futurerdquo and thatCDC ldquoshows that young people facing significant employment chal-lenges can contribute to and learn from the resources available onthe Information Highwayrdquo41 The documentation reflecting thelower level of technical expertise and scarcity of computer hardwareat the time was comprehensive participants were encouraged toidentify a collection digitize it process it (by cleaning up the data)preserve it on a hard drive and display it on the web Issues of copy-right were discussed as well as preservation issues and hardwarerequirements for computers (a computer with an Intel 386 processor8MB RAM a 1GB hard drive and 288bps modem was recom-mended) and training information for employers educators andparticipants42 Sample proposals community liaison ideas amidstother information set out a very comprehensive roadmap for a grass-roots bottom-up approach to digital preservation and heritage

Moving through the preserved websites is an interesting snap-shot at how web design standards changed Colour disappearsreplaced by a more subdued coordinated federal website standard(still largely in effect today) Surveys greet viewers from 2003onwards trying to learn how viewers discovered the site whetherthey think it is a good educational resource (they must have beenrelieved that the majority thought so mdash in 2003 some 84 percentfelt that it was a ldquogoodrdquo resource the best of its type)43 By 2004 theprogram was engaging in active remembrance of its past programsas it neared its ten-year anniversary Proposals stopped beingaccepted in July 2004 but the 600 webpages were still accessiblewith a comprehensive subject index alphabetical listing success sto-ries and detailed copyright information

The Canadarsquos Digital Collections site began to move towardsremoval in November 2006 and was initially completely gone A visi-tor to the site would learn that the content ldquois no longer availablerdquo andan e-mail address was provided for contributors to get their informationback44The information was completely inaccessible and by early 2007a bold notice announced that the now-empty site would close as of 2April 2007 Luckily by April 2007 a notice appeared with a declara-tion that the collections were now archived by Library and Archives

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Canada just in time for the website to completely disappear by July2007 In the archived version subject indexes and headings were goneAll that remained then and today for that matter is a straight-forwardalphabetical list and a disclaimer that noted ldquo[y]ou are viewing a doc-ument archived by Library and Archives Canada Please noteinformation may be out of date and some functionality lostrdquo

The question now is what we can do with this collection ofwebsites Could a distant reading as opposed to a close reading ona targeted level tell us something about this collectionFortunately we have a variety of tools that can help us navigate thislarge array of information several of which I use to navigate largeamounts of data First I create my own tools through computerprogramming This work is showcased in the following section I program in Wolfram Researchrsquos Mathematica 9 Mathematica is anintegrated platform for technical computing that allows users toprocess visualize and interact with exceptional arrays of informa-tion45 It is a language with an engaged and generous user base thatcomplements exhaustive well-written documentation fromWolfram Research To emphasize its ease one line of code couldimport a website in perfectly readable plaintext (avoiding unicodeerrors) as follows input=Import[ldquohttpactivehis-torycaldquordquoPlaintextrdquo] and a second one could make itall in lower text for textual analysis as such lowerpage=ToLowerCase[input] Variables are a critical building blockIn the former example ldquoinputrdquo now holds the entire plaintext of thewebsite ActiveHistoryca and then ldquolowerpagerdquo holds it in lowercaseOnce we have lower-cased information we are quickly able to extractthe following frequent words frequent phrases and could thus cre-ate a dynamic database of n-grams word clouds or run other textualanalysis programs on the material Much of the syntax adheres to arecognizable English-language format

Even more promising is Mathematicarsquos integration with theWolfram|Alpha database (itself accessible at wolframalphacom) acomputational knowledge engine that promises to put incredibleamounts of information at a userrsquos fingertips With a quick com-mand one can access the temperature at Torontorsquos Lester B PearsonInternational Airport at 500 pm on 18 December 1983 the rela-

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

tive popularity of names such as ldquoEdithrdquo or ldquoEmmardquo the Canadianunemployment rate in March 1985 or the Canadian GrossDomestic Product in December 199546 This allows one to quicklycross-reference findings with statistical information or find coinci-dental points of comparison As universities are now being activelyencouraged to contribute their information to Wolfram|Alpha thiswill become an increasingly powerful database

For many researchers the downside of Mathematica is a con-siderable one It is proprietary software and relatively expensive forstudent and non-student alike Outputs however can be dissemi-nated using the Computational Document Format (CDF) enablinghistorians to put their work online and entice the general public todynamically work with their models and findings

Programming is not the be all and end all Information visual-ization in the humanities is a growing international field withCanadian scholars heavily involved The most important and acces-sible example is Voyant Tools47 It is a powerful suite of textualanalysis tools Uploading onersquos own text or collection of texts draw-ing on a webpage or analyzing existing bodies of work one sees thefollowing a ldquoword cloudrdquo of a corpus click on any word to see itsrelative rise and fall over time (word trends) access the original textand click on any word to see its contextual placement within indi-vidual sentences It ably facilitates the ldquodistancerdquo reading of anyreasonably-sized body of textual information and should be a begin-ning point for those interested in this field The only downside isthat Voyant can choke on very large amounts of information as it isa textual analysis rather than robust data mining tool

There are several other complementary tools A notable one is theMany Eyes website hosted by IBM Research Many Eyes allows you toupload datasets be they textual tabular or numerical There are manyvisualization options provided you can plot relationships betweendata compare values (ie bar charts and bubble charts) track dataover time and apply information to maps The only downside toMany Eyes is that you must register and make your datasets available toall other registered researchers as well as IBM This limits your abilityto use copyrighted or proprietary information

More advanced technical users can avail themselves of a host of

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

textual analysis programs found elsewhere on the web The mostcomprehensive gathering of these tools can be found with theSoftware Environment for the Advancement of Scholarly Research(SEASR) and its Meandre workbench Meandre which can beinstalled as a local server on your system or mdash in a best case scenariomdash hosted by an institutional IT department allows you to tap intoa variety of workflows to analyze data48 One workflow mentionedin this article is the topic modeling toolkit found within the MachineLearning for Language Toolkit (MALLET) package which clustersunstructured data into ldquotopicsrdquo which appear together giving a quicksense of a documentrsquos structure49 Other workflows found withinSEASR include Dunningrsquos log-likelihood statistic which compares aselected document to a broader reference corpus For example youcould compare the lyrics that appear in 1968 against all lyrics in thepostwar era and learn what makes 1968 most unique50 sentimentanalyses which attempt to plot changes in emotional words andphrases over time and a variety of entity extraction algorithms(which find ldquopeoplerdquo ldquoplacesrdquo and so forth in a text) Corpuses thatwere previously too big to extract meaningful information from cannow be explored relatively quickly

As befits a technical side to history all of the freely availabletools discussed briefly here (notably wget SEASR and MALLET)are discussed in more technical terms in the peer-reviewedProgramming Historian 2 or on my own blog at ianmilliganca I haveelected to leave them online rather than provide commands in thisarticle they require some technical knowledge and directly workingfrom the web is quicker than here Please consider the following sec-tion an applied version of some of the lessons you can learn throughthese processes

From Web Collection to Dynamic Finding Aids

Let us briefly return to our archival metaphor Todayrsquos historiandealing with conventional sources often has to physically travel to anarchive requisition boxes perhaps take photographs of materials iftime is tight and otherwise take notes With born-digital sourceshowever the formula is reversed Our sources can come to us In a

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

few keystrokes we can bring an unparalleled quantity of informationto our home or work computers With Canadarsquos Digital Collectionsa single command line entry brought the entire archive onto my sys-tem it took days to process the information and I had to leave mycomputer on but it did not require any active user engagement

How big is the CDC collection in terms of sheer informationThe plaintext without any HTML markup is approximately7805135 words If each page contained 250 words it would be theequivalent of a 31220 page essay or 1000 dissertations However itis even more complicated than just reading 1000 dissertations Theinformation is spread across 73585 separate HTML files themselvespart of a much broader corpus of 278785 files Even the sheer sizeof the database requires big data methodologies an aggregate size of726 GB 360 MB of plaintext with HTML encoding reduced to alsquomerersquo 50MB if all encoding is removed A 50MB plain text docu-ment can quickly slow down even a relatively powerful personalcomputer There is no simple way to make this a conventionallyaccessible amount of information

How can we access this information Several different methodsare possible First we can work with the information on the internetitself or we can download it onto our own system I prefer the lattermethod as it decreases computational time adds flexibility andenables free rein to experiment with material Before you do thisthough you need to be familiar with the websitersquos ldquoterms of servicerdquoor the legal documentation that outlines how users are allowed to usea website resources or software In this case LAC provides generousnon-commercial reproduction terms information ldquomay be repro-duced in part or in whole and by any means without charge orfurther permission unless otherwise specifiedrdquo51

The CDC web archive index is organized like so the index islocated at wwwepelac-bacgcca100205301iccdcEAlphabetaspwith every subsequent website branching off the cdc folder Forexample one website would be hosted in wwwepelac-bacgcca100205301iccdcABbenevolat the next in hellipcdcabnatureand so on Using wget a free software program you can downloadall websites in a certain structure the program recursively downloadsinformation in a methodical fashion I first used wget to save the

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

cookies (little files kept on your system in this case one that notedthat you had seen the disclaimer about the archival nature of thewebsite) and then with an one-line command in my command lineto download the entire database52 Due to the imposition of a ran-dom wait and bandwidth limit which meant that I would downloadone file every half to one and a half seconds at a maximum of 30kilobytes a second the process took six days In the ProgrammingHistorian 2 I have an article expanding upon wget for humanitiesresearchers

Hypertext itself deserves a brief discussion The basic buildingblock of the web is the link one page is connected to another andpeople navigate through these webs While books and articles forexample are connected to each other by virtue of ideas and concretecitations these must often be inferred online HTML born-digitalsources are different in that the link offers a very concrete file One

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Visualizing the CDC Collection as a Series of Hyperlinks

quick visualization the circular cluster seen above lets us see theunique nature of this information We can go to each web pagescrape the links and see where they take viewers As we see in thebelow figure the CDC collection can be visualized as a series ofhyperlinks (each line) The cluster in the middle is the CDC collec-tion itself whereas the strands on the outside represent externalwebsites Using Mathematica I have generated a dynamic modelyou can move your mouse over each line learn what link it is andthus see where the collection fits into the broader Internet The siteis limited to following two links beyond the first page Through thisapproach we see the interconnection of knowledge and relationshipsbetween different sources At a glimpse we can see suggestions to fol-low other sites and determine if any nodes are shared amongst thevarious websites revealed

Visualization Moving Beyond Word Clouds

Where to begin with a large collection such as the Canadarsquos DigitalCollection corpus As a historian who is primarily interested in thewritten word my first step was to isolate the plaintext throughout thematerial so that I could begin rudimentary data mining and textualanalysis on the corpus For this I approached the CDC collection ontwo distinct levels mdash first the aggregate collection and second divid-ing the collection into each lsquofilersquo or individual website mdash derivingword and phrase data for each Imagine if you could look throughinto an archival collection and get a quick sense of what topics andideas were covered Taking that thought further imagine the utility ifyou could do this with individual archival boxes and even files Witha plethora of born-digital sources this is now a necessary step

On an aggregate level word and phrase data is a way to explorethe information within How often do various words within thenearly eight million words of the collection appear UsingMathematica we can quickly generate a list of words sorted by theirrelative frequency we learn that the word ldquoCanadardquo appears 26429times ldquocollectionsrdquo 11436 and ldquo[L]ouisbourgrdquo an astounding9365 times Reading such a list can quickly wear out a reader and isof minimal visual attractiveness Two quick decisions were made to

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

enhance data usability first the information was ldquocase-foldedrdquo allsent to lower-case this enables us to count the total appearances of aword such as ldquodigitalrdquo regardless of whether or not it appears at thebeginning of a sentence

Second when dealing with word frequency mdash as opposed tophrase frequency mdash ldquostop wordsrdquo should be removed as they werehere These are ldquoextremely common words which would appear to beof little value in helping select documents matching a user needrdquowords such as ldquotherdquo ldquoandrdquo ldquotordquo and so forth They matter forphrases (the example ldquoto be or not to berdquo is comprised entirely ofstop words) but will quickly dominate any word-level visualizationor list53 Other words may need to be added or removed from stoplists depending on the corpus for example if the phrase ldquopage num-berrdquo appeared on every page that would be useless information forthe researcher

The first step then is to use Mathematica to format word fre-quency information into Wordlenet readable format (if we put allof the text into that website it will crash mdash so we need to followthe instructions on that website) we can then draw up a wordcloud of the entire collection Here we can see both the advantagesand disadvantages of static word clouds What can we learn from this

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A Word Cloud of All Words in the CDC Corpus

illustration First without knowing anything about the sites we canimmediately learn that it is dealing with CanadaCanadians that itis a collection with documents in French and English that we aredealing with digitized collections and that there is a fair amount ofregional representation not based upon population (note thatldquoontariordquo is less frequent than both ldquonewfoundlandrdquo ldquonova scotiardquoand ldquoalbertardquo although ldquoqueacutebecrdquo and ldquoquebecrdquo is not terribly promi-nent) Certain kinds of history dominate communities familynames the built form (houses churches schools and so forth) At avery macro level we can learn several characteristics before doing anydeeper investigation

These forms of visualization need to be used with extreme cau-tion however First while visually attractive it leaves a severeamount of ambiguity are ldquoontariordquo and ldquoqueacutebecrdquo equally promi-nent for example or is there a slight divergence Even moreproblematically we cannot learn context from the above picturedoes the word ldquoschoolrdquo refer to built structures or educational expe-riences The same question with applies to ldquochurchrdquo Do mentionsof ldquonewfoundlandrdquo refer to historical experiences there or withexpatriates or in travel narratives Word clouds are useful first stepsand have their place but need to be used with extreme caution Formore scholarly applications tables offer much more precise inter-pretation As the second figure below demonstrates they can bemore helpful when dealing with the individual sites cuttingthrough ambiguity The figure below is a visualization of the web-site ldquopast and presentrdquo Despite the ambiguous title we can quicklylearn what this website contains from this visualization Wordle isthus a useful first step analogous to taking magically taking a quicklook through the side of an archival box We still need more infor-mation however to really make this form of textual analysishelpful To navigate this problem I have created my own dynamicfinding aid program using Mathematica which scales to any born-digital web collection

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Two approaches were taken First a textual representation of wordand phrase frequency allows us to determine the content of a given filerelatively quickly I wrote a short program to extract word frequency(stop words excluded) and phrase frequency For phrases we speak ofn-grams bigrams trigrams quadgrams and fivegrams although onecould theoretically do any higher number A bigram can be a combina-tion of two characters such as lsquobirsquo two syllables or mdash pertinent in ourcase mdash two words An example ldquocanada isrdquo A trigram is three wordsquadgram is four words and a fivegram is five words and so on

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Word Cloud of the File ldquoPast and Presentrdquo

A viewer for n-grams in the CDC Collection

The CDC collection demonstrates this approachrsquos utility Usingmy Mathematica textual CDC browser pictured above I randomlyexplored file 267 the ambiguously named ldquopast and presentrdquo web-site This is one of the largest individual websites 122363 words orabout 493 standard typed pages Such an ambiguous name could beabout almost anything Initially I decided to study word frequency

ldquoldquoalbertardquo 758rdquo ldquoldquolistenrdquo 490rdquo ldquoldquosettlersrdquo 467rdquoldquoldquonewrdquo 454rdquo ldquoldquocanadardquo 388rdquo ldquoldquolandrdquo 362rdquo ldquoldquoreadrdquo328rdquo ldquoldquopeoplerdquo 295rdquo ldquoldquosettlementrdquo 276rdquo ldquoldquochineserdquo268rdquo ldquoldquocanadianrdquo 264rdquo ldquoldquoplacerdquo 255rdquo ldquoldquoheritagerdquo254rdquo ldquoldquoedmontonrdquo 250rdquo ldquoldquoyearsrdquo 238rdquo ldquoldquowestrdquo 236rdquoldquoldquoraymondrdquo 218rdquo ldquoldquoukrainianrdquo 216rdquo ldquoldquocamerdquo 213rdquoldquoldquocalgaryrdquo 209rdquo ldquoldquonamesrdquo 206rdquo ldquoldquoranchrdquo 204rdquoldquoldquoupdatedrdquo 193rdquo ldquoldquohistoryrdquo 190rdquo ldquoldquocommunitiesrdquo183rdquo hellip

Rather than having to visit the website at a glance we can almostimmediately see what the site is about More topics become apparentwhen one looks at trigrams for example in this automatically gen-erated output of generated top phrases

ldquoldquofirstrdquo ldquopeoplerdquo ldquosettlersrdquo 100rdquo ldquoldquotherdquo ldquounitedrdquoldquostatesrdquo 72rdquo ldquoldquoonerdquo ldquoofrdquo ldquotherdquo 59rdquo ldquoldquoofrdquo ldquotherdquoldquowestrdquo 58rdquo ldquoldquosettlersrdquo ldquolastrdquo ldquoupdatedrdquo 56rdquo ldquoldquopeoplerdquoldquosettlersrdquo ldquolastrdquo 56rdquo ldquoldquoalbertansrdquo ldquofirstrdquo ldquopeoplerdquo 56rdquoldquoldquoadventurousrdquo ldquoalbertansrdquo ldquofirstrdquo 56rdquo ldquoldquotherdquo ldquobarrdquoldquourdquo 55rdquo ldquoldquoopeningrdquo ldquoofrdquo ldquotherdquo 54rdquo ldquoldquocommunitiesrdquoldquoadventurousrdquo ldquoalbertansrdquo 54rdquo ldquoldquonewrdquo ldquocommunitiesrdquoldquoadventurousrdquo 54rdquo ldquoldquobarrdquo ldquourdquo ldquoranchrdquo 53rdquo ldquoldquolistenrdquoldquotordquo ldquotherdquo 52rdquo hellip

Through a few clicks we can navigate and determine the basiccontours mdash from a distance mdash of what the site pertains to In aviewer you can use a slider to dynamically see more informationor less information quickly toggling between bigrams trigramsand quadgrams allowing you to quickly hone in on immediately

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

relevant information This can be useful but we need to moveback even further to achieve a distant reading of this born-digitalcollection

If one zooms back out to this very distant level we can extractn-gram data from the entire collection We have two approachesfirst aggregate appearances of a given term (if ldquoaboriginalrdquo appears457 times then n=457) or second relative appearances of a giventerm (if ldquoaboriginalrdquo appears in 45 of 384 sites then n=45384 or01171875) In my other work on music lyrics the latter optionenables you to control for the lsquochorusrsquo effect as the former can be drowned out by one song with repetitive words Given thenature of this collection I elected for the former approach Whatcan we learn from trigrams for example After an initial set of trigrams pertaining to the Louisbourg digitization project whichhas boilerplate language on each page as well as common ldquostop-wordrdquo equivalents of phrases (especially pronounced in French) we come across the first set of meaningful phrases excerptedbelow

ldquoexperiencerdquo ldquoinrdquo ldquotherdquo 919 ldquotransferredrdquo ldquotordquo ldquotherdquo918 ldquofundedrdquo ldquobyrdquo ldquotherdquo 918 ldquoofrdquo ldquotherdquo ldquoprovincialrdquo915 ldquothisrdquo ldquowebrdquo ldquositerdquo 915 ldquo409rdquo ldquoregistryrdquo ldquo2rdquo914 ldquoarerdquo ldquonordquo ldquolongerrdquo 914 ldquogouvernementrdquo ldquodurdquo ldquocanadardquo 910 ldquouserdquo ldquothisrdquo ldquolinkrdquo 887ldquopagerdquo ldquoorrdquo ldquoyourdquo 887 ldquofollowingrdquo ldquopagerdquo ldquoorrdquo 887 ldquoautomaticallyrdquo ldquotransferredrdquo ldquotordquo 887 ldquoberdquoldquoautomaticallyrdquo ldquotransferredrdquo 887 ldquowillrdquo ldquoberdquo ldquoautomaticallyrdquo 887 ldquoindustryrdquo ldquoyourdquo ldquowillrdquo 887ldquomultimediardquo ldquoindustryrdquo ldquoyourdquo 887 ldquopracticalrdquo ldquoexperiencerdquo ldquoinrdquo 887 ldquogainedrdquo ldquopracticalrdquo ldquoexperiencerdquo887 ldquotheyrdquo ldquogainedrdquo ldquopracticalrdquo 887 ldquowhilerdquo ldquotheyrdquoldquogainedrdquo 887 ldquocanadiansrdquo ldquowhilerdquo ldquotheyrdquo 887ldquoyoungrdquo ldquocanadiansrdquo ldquowhilerdquo 887 ldquoshowcaserdquo ldquoofrdquoldquoworkrdquo 887 ldquoexcellentrdquo ldquoshowcaserdquo ldquoofrdquo 887 hellip

Here we see an array of common language funding informa-tion link information project information spread across over 887

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

websites Reading down we see information about governments jobexperience project information initiatives re-structuring and soforth At a glance we can see what language is common to all web-sites as opposed to unique information Moving down we find themore common but less universal ldquothe history ofrdquo for example ldquoher-itage community foundationrdquo libraries and so forth By the end ofthe list we can move into the unique or very rare phrases that appearonly once twice or three times

Moving beyond word and phrase frequency there are a fewmore sophisticated off-the-shelf options that can help us make senseof large arrays of unstructured data Topic modeling is among themost developed as Princetonrsquos David M Blei (a Computer Scienceprofessor) describes in a general introduction to the topic

While more and more texts are available online we simplydo not have the human power to read and study them toprovide the kind of [traditional] browsing experiencedescribed above To this end machine learning researchershave developed probabilistic topic modeling a suite of algo-rithms that aim to discover and annotate large archives ofdocuments with thematic information Topic modelingalgorithms are statistical methods that analyze the words ofthe original texts to discover the themes that run throughthem how those themes are connected to each other andhow they change over time hellip Topic modeling enables usto organize and summarize electronic archives at a scalethat would be impossible by human annotation55

Using the aforementioned SEASR Meandre workbench (or viaMALLET itself and a command-line interface) we can quickly beginto run our born-digital information through the topic-modelingalgorithm The results are notable and come in two formats a wordcloud visualization as well as an XML file containing a line-by-linebreakdown of each topic Taking the aforementioned file 267 (ldquopastand presentrdquo) we can begin to get a sense of the topics this large web-site discusses

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

At a glance we can learn about some of the main themes the struc-ture of Alberta (Alberta towns bridges lakes) the geographic shapeof the area (life land lands areas links grass and interestingly lan-guage) the stories (story knight archives but also interestinglyknight and wife) and so forth The word cloud gives us a sense ofthe most relevant terms that define each topic but we also have a3200-line XML file broken into topics which lets us further refineour inquiry

A last set of examples will help us move from the aggregate levelof hundreds of thousands of lines of text to the individual line inessence creating our own search engine to navigate this collection

There are a few good reasons why we should not use Google todo all of this in our own work First we do not know how Googleworks precisely It is a proprietary black box While the original algo-rithm is available the actual methods that go into determining theranking of search terms is a trade secret Second Google is optimizedfor a number of uses mainly to deliver the most relevant informa-tion to the user Historians and archivists are looking often forspecific information within old websites and it behooves us to exam-ine them through the lens of more specialized tools With thesehesitations in mind I have sought to develop my own rudimentarysearch engine for this born-digital collection

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A two-pronged approach can take us to relevant informationquickly In the following example I am carrying out a project onblack history We quickly search all of the CDC word frequency filesfor the occurrence of the word ldquofreedomrdquo ldquoblackrdquo and ldquolibertyrdquo(these terms of course require some background knowledge thesearch is shaped by the user) obtain a list and then map it onto achart in order to quickly visualize their locations in the collection Forexample in the graph above we see a visualization of a given term inthis case our three search terms As one moves the mouse over eachbar the frequency data for a given term appears We can also zoomand enhance this graph to use it as we wish A demonstration of thiscan be found online at wwwianmilliganca20130118cdc-tool-1

We can then move down into the data In the example aboveldquo047rdquo labeled ldquoblackgoldtxtrdquo contains 31 occurrences of the wordldquoontariordquo Moving to my Mathematica keyword-in-context routinewe can see where these occurrences of the word appear

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

A snapshot of the dynamic finding aid tool

We see that this website focuses mostly on southwestern Ontariowhere many oil fields are If we move forward we can see that it per-tains to the nineteenth century a particular region as well assubsequent memorialization activities (mention of a museum forexample) All this can be done from within the textual client asopposed to going out into individual websites Broad contoursemerge from this distant reading but they also facilitate deeper read-ing by allowing us to promptly realize critical sources

Another model through which we can visualize documents isthrough a viewer primarily focusing on term document frequencyand inverse frequency terms discussed below To prepare this mater-ial we are combining our earlier data about word frequency (that theword ldquoexperiencerdquo appears 919 times perhaps) with aggregate infor-mation about frequency in the entire corpus (ie the wordldquoexperiencerdquo appears 1766 times) The number of times that a termappears in a given website can be defined as tf (term frequency)Another opportunity is inverse document frequency where we canlearn what words are most distinct and unique in a given documentThe formula for this is the oft-cited tf-idf a staple of informationretrieval theory or tf times the logarithm of the number of docu-ments divided into the number of documents where the given termappears At a glance then we can see some of the unique words con-tained within a website if we select ldquoinverse document frequencyrdquo

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

At a very quick glance one can see the major words that make this aunique document This technique can be used on any textual docu-ment

Off-the-shelf tools can facilitate a deeper yet distant reading ofindividual websites as well especially the suite of research tools onwebsites such as Many Eyes and Voyant Using these tools howeverrequires that one puts individual data into each separate programrather than quickly navigating between them This stems from themassive amount of files and data involved Many Eyes limits files to5MB and Voyant has no hard and fast rule but can struggle withlarger files

This analysis was carried out on my personal computer55 Largerdata sets may necessitate a corresponding increase in computationalpower bringing social sciences and humanities researchers into con-tact with High Performance Computing (HPC) centres or consortiaThese extremely powerful computers provide researchers with com-puting power orders of magnitude above that of personal computersallowing results in hours instead of the weeks or months it might

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

take to process on a smaller system56 This has particular applicationin the basic processing of a very large web archive There are limita-tions however jobs must be submitted into queues in advancerather than on-site processing and perhaps most importantly HPCcentres prefer to run highly optimized code to minimize usage oftheir very expensive and scarcely allocated equipment As this level ofcoding expertise is often beyond humanities researchers researchteams must be formed

Humanities researchers would do well to familiarize themselveswith local HPC options many of which are more than happy towork on new and engaging questions These are accessible optionsin Ontario for example all but one university belongs to a HPCconsortium57 Yet as personal computing continues to dramaticallyincrease in speed and as storage prices continue to plummet textualanalysis is increasingly feasible on high-quality office systems An ini-tial exposure can generate questions and problems that if the needrears itself can subsequently be taken to the supercomputer

This case study is intended to provide one model for distantreading and the generation of finding aids in the interest of showingsome of the potential for working with born-digital sources It isintended to begin or jumpstart a conversation around the resourcesand skills necessary for the next generation A seemingly insur-mountable amount of information contained across over 73000files (as in the CDC collection) can be quickly navigated and iso-lated for researcher convenience

Conclusion Digital Literacy for the Next Generation ofHistorians

As a discipline I would like to suggest one path forward prepare thenext generation of historians at least some of them to deal with thedeluge of born-digital sources Fundamentally the challenge of born-digital sources is a pedagogical one For existing historians this is anissue of continual education for future historians an issue of profes-sional training (the annual Digital Humanities Summer Instituteheld at the University of Victoria each June is worth considering asan introduction to many digital concepts) Some off-the-shelf tools

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

can serve as gateways to deeper engagement notably the VoyantTools suite (wwwvoyant-toolsorg) and perhaps subsequently thelessons contained in the Programming Historian 2 That said digitalliteracy courses should become part of the essential pantheon of his-torical curricula This should not be mandatory but instead taughtalongside and within traditional historiography and research meth-ods courses In an era of diminishing resources and attention paid toknowledge translation and relevancy digital literacy courses are alsowell positioned to provide graduates with concrete portfolios andactionable skills If we want to be prepared to use digital method-ologies to study the post-1990 world we will need ready historians

Such a course should have six basic priorities First we must pro-vide historiographical context for new students on whatcomputational methods have been tried before How have quantita-tive and qualitative methods differed yet ultimately complementedeach other Second students in this course would need to develop anunderstanding of the digital basics cloud computing effectivebackup and versioning digital security and an understanding of howthe Internet is organized Third for effective cross-departmental andsustained engagement with digital history students must to learn howto digitally navigate conventional sources notable skill sets wouldinclude a familiarity with citation management software databasesand the possibility of using optical character recognition on a varietyof born-paper sources subsequently digitized through scanners orcameras This must also include a grasp of the scope and significanceof large digital depositories including Google Books the InternetArchive and others mentioned earlier Fourth students must learnhow to process large amounts of digital information this includes off-the-shelf visualization tools material discussed here and some of thephilosophical and methodological implications of a distance readingFifth students can learn basic programming through the ProgrammingHistorian Finally and perhaps most importantly students need anunderstanding of various ways to openly disseminate material on the web whether through blogging micro-blogging publishing webpages and so forth it is important to enable the great work that we do in the university to make its mark on the world Throughthis students will also come away from history departments with

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

comprehensive and marketable digital portfolios Such a coursewould focus more on homework than reading with special emphasisplaced on deliverable content such as websites interactive modelsand broadly implementable databases

Digital history and the readiness to work with born-digitalsources cannot stem out of pedagogy alone however Institutionalsupport is necessary While based on anecdotal evidence it appearsthat some university administrators have begun to embrace the shifttowards the digital humanities At the federal level SSHRC has iden-tified the Digital Economy as one of its priority areas58 Yet when itcomes to evaluation doctoral and post-doctoral researchers do nothave a ldquodigital humanitiesrdquo option open to them in terms of assessingcommittees leading to fears that their projects might be lost inbetween disciplinary areas of focus59 In light of the impending crisisof born-digital sources historians should encourage these initiatives

These sorts of digital projects do not always lend themselveswell to traditional publication venues The examples above from theCDC case study for example are best viewed as dynamically mal-leable models this will be one challenge as historians grapple withnew publication models This article argues first why digitalmethodologies are necessary and second how we can realize aspectsof this call The Internet is 20 years old and before historians realizeit we will be faced with the crisis of born-digital information over-load If history is to continue as the leading discipline inunderstanding the social and cultural past decisive movementtowards the digital humanities is necessary When the sea comes inwe must be ready to swim

IAN MILLIGAN is an assistant professor of Canadian history at theUniversity of Waterloo He is a founding co-editor of ActiveHistorycaan editor-at-large of the Programming Historian 2 and has written sev-eral articles in the fields of Canadian youth digital and labour history

IAN MILLIGAN est professeur adjoint en histoire canadienne agrave laUniversity of Waterloo Il est le fondateur et codirecteur du site

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Internet ActiveHistoryca ainsi qursquoun eacutediteur du site ProgrammingHistorian 2 Il a eacutecrit plusieurs articles au sujet de la jeunesse cana-dienne de la numeacuterisation et de lrsquohistoire des travailleurs

Endnotes

1 The question of when the past becomes a ldquohistoryrdquo to be professionallystudied is a tricky one Yet it is worth noting that the still-definitivework on the Baby Boom Generation appeared in 1996 with DougOwram Born at the Right Time A History of the Baby Boom Generationin Canada (Toronto University of Toronto Press 1996) This actuallyfollowed the earlier scholarly monograph Cyril Levitt Children ofPrivilege Student Revolt in the Sixties (Toronto University of TorontoPress 1984) and a host of other laudable works As academicdebatesworks on the 1960s now suggest the past becomes historysooner than we often expect

2 A critical point advanced in the field of literary criticism and theory byStephen Ramsay Reading Machines Towards an Algorithmic Criticism(Urbana University of Illinois Press 2011)

3 See ldquoIdleNoMore Timelinerdquo ongoingwwwidlenomoremakookcatimeline44 ltviewed 18 January 2013gt

4 These issues have been discussed elsewhere albeit in either a moreabridged form or with a focus on digitized historical sources A notableexception is Daniel J Cohen and Roy Rosenzweig Digital History AGuide to Gathering Preserving and Presenting the Past on the Web(Philadelphia University of Pennsylvania Press 2005) specifically theirchapter on ldquoCollecting History onlinerdquo Their focus is on the preservation-ist and archival challenges however Important notes are also found inDan Cohen ldquoHistory and the Second Decade of the Webrdquo RethinkingHistory 8 no 2 (June 2004) 293ndash301 also available online atwwwchnmgmueduessays-on-history-new-mediaessaysessayid=34

5 Figure taken from Jose Martinez ldquoTwitter CEO Dick Costolo Reveals Staggering Number of Tweets Per Dayrdquo ComplexTech wwwcomplexcomtech201210twitter-ceo-dick-costolo-reveals-staggering-number-of-tweets-per-day ltviewed 6 January 2013gtFurther more while a tweet itself is only 140 characters every tweetcomes with over 40 lines of metadata date information author biogra-phy and location information when the account was created the usersrsquofollowing and follower numbers country data and so forth This is aremarkable treasure trove of information for future historians but also apotential source of information overload See Sarah Perez ldquoThis is What

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

a Tweet Looks Likerdquo ReadWriteWeb (19 April 2010) wwwreadwritewebcomarchivesthis_is_what_a_tweet_looks_likephpltviewed 6 January 2013gt

6 This is further discussed in Ian Foster ldquoHow Computation ChangesResearchrdquo in Switching Codes Thinking Through Digital Technology inthe Humanities and the Arts eds Thomas Bartscherer and RoderickCoover (Chicago University of Chicago Press 2011) esp 22ndash4

7 Franco Moretti Graphs Maps Trees Abstract Models for Literary History(New York Verso 2005) 3ndash4

8 While not of prime focus here the longue dureacutee deserves definition Itstands opposite the history of events of instances covering instead avery long time span in an interdisciplinary social science framework In along essay Braudel noted that one can then see both structural crises ofeconomies and structures that constrain human society and develop-ment For more see Fernand Braudel ldquoHistory and the Social SciencesThe Longue Dureacuteerdquo in Fernand Braudel On History trans SarahMatthews (Chicago University of Chicago Press 1980) 27 The essaywas originally printed in the Annales ESC no 4 (OctoberndashDecember1958)

9 A great overview can be found at ldquoText Analysisrdquo Tooling Up for DigitalHumanities 2011 wwwtoolingupstanfordedupage_id=981 ltviewed19 December 2012gt

10 Jean-Baptiste Michel Yuan Kui Shen Aviva Presser Aiden Adrian VeresMatthew K Gray The Google Books Team Joseph P Pickett DaleHoiberg Dan Clancy Peter Norvig Jon Orwant Steven Pinker MartinA Nowak Erez Lieberman Aiden ldquoQuantitative Analysis of CultureUsing Millions of Digitized Booksrdquo Science 331 (14 January 2011)176ndash82 The tool is publicly accessible at ldquoGoogle Books NgramViewerrdquo Google Books 2010 wwwbooksgooglecomngrams ltviewed11 January 2013gt

11 Anthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

12 Comment by Jean-Baptiste Michel and Erez Lieberman Aiden onAnthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

13 For more information see the Canadian Century Research InfrastructureProject website at wwwccriuottawacaCCRIHomehtml and the Programme de recherche en deacutemographie historique at wwwgeneologyumontrealcafrleprdhhtm

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

14 ldquoWith Criminal Intentrdquo CriminalIntentorg wwwcriminalintentorgltviewed 20 December 2012gt

15 For an overview see ldquoAward Recipients mdash 2011rdquo Digging into DataWebpage wwwdiggingintodataorgHomeAwardRecipients2011tabid185Defaultaspx ltviewed 20 December 2012gt

16 One can do no better than the completely free Programming Historianwhich uses entirely free and open-source programming tools to learn thebasics of computational history See The Programming Historian 2 2012-Present wwwprogramminghistorianorg ltviewed 11 January 2013gt

17 Michael Katz The People of Hamilton Canada West Family and Class ina Mid-Nineteenth-Century City (Cambridge MA Harvard UniversityPress 1975) Also see A Gordon Darroch and Michael D OrnsteinldquoEthnicity and Occupational Structure in Canada in 1871 The VerticalMosaic in Historical Perspectiverdquo Canadian Historical Review 61 no 3(1980) 305ndash33

18 Ian Anderson ldquoHistory and Computingrdquo Making History The ChangingFace of the Profession in Britain 2008 wwwhistoryacukmakinghistoryresourcesarticleshistory_and_computinghtml ltviewed 20 December2012gt

19 Christopher Collins ldquoText and Document Visualizationrdquo TheHumanities and Technology Camp Greater Toronto Area VisualizationBootcamp University of Ontario Institute of Technology Oshawa Ont21 October 2011

20 The quotations by Luther and Poe are commonly used to make thispoint For the best example see Clay Shirky Cognitive SurplusCreativity and Generosity in a Connected Age Google eBook Penguin2010 For Mumford see Lewis Mumford The Myth of the Machine vol2 The Pentagon of Power (New York Harcourt Brace 1970) 182 asquoted in Gleick The Information A History A Theory A Flood (NewYork Pantheon 2011) 404

21 Adam Ostrow ldquoCisco CRS-3 Download Library of Congress in JustOver One Secondrdquo Mashable (9 March 2010) accessed onlinewwwmashablecom20100309cisco-crs-3

22 A note on data size is necessary The basic unit of information is a ldquobyterdquo (8 bits) which is roughly a single character A kilobyte (KB) is athousand bytes a megabyte (MB) a million a gigabyte (GB) a billion aterabyte (TB) a trillion and a petabyte (PB) a quadrillion Beyond thisthere are exabytes (EB) zettabytes (ZB) and yottabytes (YB)

23 These figures can of course vary depending on compression methodschosen Their math was based on an average of 8MB per book if it wasscanned at 600dpi and then compressed There are 26 million books inthe Library of Congress For more see Mike Ashenfelder ldquoTransferringlsquoLibraries of Congressrsquo of Datardquo Library of Congress Digital Preservation

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Blog 11 July 2011wwwblogslocgovdigitalpreservation201107transferring-libraries-of-congress-of-data ltviewed 11 January 2013gt

24 Gleick 232 25 ldquoAbout the Library of Congress Web Archivesrdquo Library of Congress Web

Archiving FAQs Undated wwwlocgovwebarchivingfaqhtmlfaqs_02ltviewed 11 January 2013gt

26 ldquo10000000000000000 Bytes Archivedrdquo Internet Archive Blog 26October 2012 wwwblogarchiveorg2012102610000000000000000-bytes-archivedltviewed 19 December 2012gt

27 Jean-Baptiste Michel et al 176ndash82 Published Online Ahead of Print16 December 2010 wwwsciencemagorgcontent3316014176ltviewed 29 July 2011gt See also Scott Cleland ldquoGooglersquos PiracyLiabilityrdquo Forbes 9 November 2011 wwwforbescomsitesscottcle-land20111109googles-piracy-liability ltviewed 11 January 2013gt

28 ldquoGovernment of Canada Web Archiverdquo Library and Archives Canada 17October 2007 wwwcollectionscanadagccawebarchivesindex-ehtmlltviewed 11 January 2013gt

29 For background see Bill Curry ldquoVisiting Library and Archives inOttawa Not Without an Appointmentrdquo Globe and Mail 1 May 2012wwwtheglobeandmailcomnewspoliticsottawa-notebookvisiting-library-and-archives-in-ottawa-not-without-an-appointmentarticle2418960 ltviewed 22 May 2012gt and the announcement ldquoLAC Begins Implementation of New Approach to service Deliveryrdquo Library and Archives Canada February 2012wwwcollectionscanadagccawhats-new013-560-ehtml ltviewed 22May 2012gt For professional responses see the ldquoCHArsquos comment onNew Directions for Library and Archives Canadardquo CHA Website March2012 wwwcha-shccaenNews_39items19html ltviewed 19December 2012gt and ldquoSave Library amp Archives Canadardquo CanadianAssociation of University Teachers launched AprilMay 2012 wwwsavelibraryarchivesca ltviewed 19 December 2012gt

30 Ian Milligan ldquoThe Smokescreen of lsquoModernizationrsquo at Library andArchives Canadardquo ActiveHistoryca 22 May 2012wwwactivehistoryca201205the-smokescreen-of-modernization-at-library-and-archives-canada ltviewed 19 December 2012gt

31 ldquoSave LAC May 2012 Campaign Updaterdquo SaveLibraryArchivesca May2012 wwwsavelibraryarchivescaupdate-2012-05aspx ltviewed 19December 2012gt

32 The challenges and opportunities presented by such projects is extensively discussed in Daniel J Cohen ldquoThe Future of Preserving the Pastrdquo CRM The Journal of Heritage Stewardship 2 no 2

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

(Summer 2005) 6ndash19 also available freely online at the RoyRosenzweig Center for History and New Mediawwwchnmgmueduessays-on-history-new-mediaessaysessayid=39For more see ldquoSeptember 11 Digital Archive Saving the Histories of September 11 2001rdquo Centre for History and New Media andAmerican Social History Project Center for Media and Learningwww911digitalarchiveorg ltviewed 11 January 2013gt ldquoHurricaneDigital Memory Bank Collecting and Preserving the Stories of Katrina and Ritardquo Center for History and New Media wwwhurricanearchiveorg ltviewed 11 January 2013gt and ldquoOccupyArchive Archiving the Occupy Movements from 2011rdquo Centre forHistory and New Media wwwoccupyarchiveorg ltviewed 11 January2013gt

33 Daniel Cohen and Roy Rosenzweig Digital History A Guide to Gathering Preserving and Presenting the Past on the Web(Philadelphia Temple University Press 2005) chapter on ldquoPreservingDigital Historyrdquo The chapter is available fully online atwwwchnmgmuedudigitalhistorypreserving5php ltviewed 19December 2012gt

34 For an overview see ldquoPreserving Our Digital Heritage The NationalDigital Information Infrastructure and Preservation Program 2010 Report a Collaborative Initiative of the Library of CongressrdquoDigitalpreservationgov wwwdigitalpreservationgovmultimediadocumentsNDIIPP2010Report_Postpdf ltviewed 19 December 2012gt

35 Steve Meloan ldquoNo Way to Run a Culturerdquo Wired 13 February 1998available online via the Internet Archivewwwwebarchiveorgweb20000619001705httpwwwwiredcomnewsculture012841030100html ltviewed 19 December 2012gt

36 Some of this has been discussed in greater depth on my website See theseries beginning with Ian Milligan ldquoWARC Files A Challenge forHistorians and Finding Needles in Haystacksrdquo ianmilliganca 12December 2012 wwwianmilliganca20121212warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks ltviewed 20 December2012gt

37 Unstructured data as opposed to structured data is important for histo-rians but brings additional challenges Some projects have producedstructured data novels marked up using XML encoding for example soone can immediately see place names dates character names and soforth This makes it much easier for a computer to read and understand

38 See William Noblett ldquoDigitization A Cautionary Talerdquo New Review ofAcademic Librarianship 17 (2011) 3 Tobias Blanke Michael Bryantand Mark Hedges ldquoOcropodium Open Source OCR for Small-ScaleHistorical Archivesrdquo Journal of Information Science 38 no 1 (2012) 83

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 9: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

there no historian he asked when this was both primarily a histori-cal project and a team comprised of varied doctoral holders fromdisciplines such as English literature psychology computer sciencebiology and mathematics12 Culturomics project leaders ErezLeiberman Aiden and Jean-Baptiste Michel responded franklyWhile the team approached historians and some did play advisoryroles every author listed ldquodirectly contributed to either the creationof the collection of written texts (the lsquocorpusrsquo) or to the design andexecution of the specific analyses we performed No academic histo-rians met this barrdquo They continued to indict the profession

The historians who came to the meeting were intelligentkind and encouraging But they didnrsquot seem to have agood sense of how to wield quantitative data to answerquestions didnrsquot have relevant computational skills anddidnrsquot seem to have the time to dedicate to a big multi-author collaboration Itrsquos not their fault these things donrsquotappear to be taught or encouraged in history departmentsright now13

This is a serious indictment considering the pending archival shift todigital sources and one that still largely holds true (a few historydepartments now offer digital history courses but they are still out-liers) This form of history the massive analysis of digitizedinformation will continue Furthermore these approaches willattract public attention as they lend themselves well to accessibleonline deployment Unless historians are well positioned they riskceding professional ground to other disciplines

While historians were not involved in Culturomics they arehowever involved in several other digital projects Two major andambitious long-term Canadian projects are currently using compu-tational methods to reconstruct the demographic past The first isthe Canadian Century Research Infrastructure project which is cre-ating databases from five censuses and aiming to provide ldquoa newfoundation for the study of social economic cultural and politicalchangerdquo while the second is the Programme de recherche en deacutemo-graphie historique hosted by the Universiteacute de Montreacuteal13 They are

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

reconstructing the European population of Quebec in the seven-teenth and eighteenth centuries drawing heavily on parish registersFurthermore several projects in England have harnessed the wave ofinterest in ldquobig datardquo to tell compelling historical stories about theirpasts the Criminal Intent project for example used the proceedingsof the Old Bailey courts (the main criminal court in Greater London)to visualize large quantities of historical information14 The CriminalIntent project was itself funded by the Digging into Data Challengewhich brings together multinational research teams and fundingfrom federal-level funding agencies in Canada the United States theUnited Kingdom and beyond Several other historical studies havebeen funded under their auspices from studies of commodity tradein the eighteenth- and nineteenth-century Atlantic world (involvingscholars at the University of Edinburgh the University of StAndrews and York University) to data mining the degrees of eco-nomic opportunity and spatial mobility in Britain Canada and theUnited States (bringing researchers at the University of Guelph theUniversity of Leicester and the University of Minnesota together)15

These exciting interdisciplinary teams may be heralding a disciplinalshift although it is still early days

Collaborative teams have limitations though We will have tobe prepared to tackle digital sources alone on shoestring budgetsHistorians will not always be able to call on external programmers orspecialists to do work because teams often develop out of fundinggrants As governmental and university budgets often prove unableto meet consistent research needs baseline technical knowledge isrequired for independent and unfunded work If historians relyexclusively on outside help to deal with born-digital sources the abil-ity to do computational history diminishes As the success of theProgramming Historian indicates and as I will demonstrate in mycase study below historians can learn basic techniques16 Let ustweak our own algorithms

A social or cultural history of the very late-twentieth or twenty-first century will have to account for born-digital sources andhistorians will (in conjunction with archival professionals) have tocurate and make sense of the data Historians need to be able toengage with these sources themselves or at least as part of a team

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

rather than having sources mediated through the lens of outside dis-ciplines or commercial interests Algorithms may be one meaningfulway to make sense of cultural trends or extract meaningful informa-tion from billions of tweets but we must ensure professionalhistorical input goes into their crafting

The Third Wave of Computational History

Historians have fruitfully used computers in the past as part of twoprevious waves The 1960s and 1970s saw computational tools usedfor demographic population and economic histories a trend that inCanada saw full realization in comprehensive studies such as MichaelKatzrsquos The People of Hamilton Canada West17 As Ian Anderson hasconvincingly noted in a review of history and computing studies ofthis sort saw the field become associated with quantitative studies18

After a retreat from the spotlight computational history rose again inthe 1990s with the advent of the personal computer graphical userinterfaces and overall improvements in ease of use Punch cards couldnow give way to database programs GIS and early online networkssuch as H-Net and Usenet We are now I believe on the cusp of athird revolution in computational history thanks to three main fac-tors decreasing storage costs the power of the Internet anddistributed cloud computing and the rise of professionals dealingwith both digital preservation and open-source tools

Decreasing storage costs have led to massive amounts of infor-mation being preserved as Gleick noted in this articlersquos epigraph Itis important to note that this information overload is not newPeople have long worried about the impact of too much informa-tion19 In the sixteenth century responding to the rise of the printingpress the German priest Martin Luther decried that the ldquomultitudeof books [were] a great evilrdquo In the nineteenth century Edgar AllanPoe bemoaned ldquo[t]he enormous multiplication of books in everybranch of knowledge is one of the greatest evils of this agerdquo Asrecently as 1970 American historian Lewis Mumford lamented ldquotheoverproduction of books will bring about a state of intellectual ener-vation and depletion hardly to be distinguished from massiveignorancerdquo20 The rise of born-digital sources must thus be seen in

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

this continuous context of handwringing around the expansion andrise of information

Useful historical information is being preserved at a rate that isaccelerating with every passing day One useful metric to comparedataset size is that of the American Library of Congress (LOC)Indeed a ldquoLOCrdquo has become shorthand for a unit of measurementin the digital humanities Measuring the actual extant of the printedcollection in terms of data is notoriously difficult James Gleick hasclaimed it to be around 10TB22 whereas the Library of Congressitself claims a more accurate figure is about 200TB23 While thesetwo figures are obviously divergent neither is on an unimaginablescale mdash personal computers now often ship with a terabyte or moreof storage When Claude Shannon the father of information theorycarried out a thought experiment in 1949 about items or places thatmight store ldquoinformationrdquo (an emerging field which viewed messagesin their aggregate) he scaled sites from a digit upwards to the largestcollection he could imagine The list scaled up logarithmically from10 to the power of 0 to 10 to the power of 13 If a single-spaced pageof typing is 104 107 the Proceedings of the Institute of Radio Engineers109 the entire text of the Encyclopaedia Brittanica then 1014 was ldquothelargest information stockpile he could think of the Library ofCongressrdquo24

But now the Library of Congressrsquo print collection mdash somethingthat has taken two centuries to gather mdash is dwarfed by their newestcollection of archived born-digital sources the vast majority only amatter of years old compared to the much wider date-range of non-digital sources in the traditional collection The LOC has beguncollecting its own born-digital Internet archive multiple snapshotsare taken of webpages to show how they change over time Even asa selective curated collection drawing on governmental politicaleducational and creative websites the LOC has already collected254TB of data and adds 5TB a month25 The Internet Archivethrough its Wayback Machine is even more ambitious It seeks toarchive every website on the Internet While its size is also hard toput an exact finger on as of late 2012 it had over 10 Petabytes ofinformation (or to put it into perspective a little over 50 Library ofCongresses if we take the 200TB figure)26

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

This is to some degree comparing apples and oranges theLibrary of Congress is predominantly print whereas the InternetArchive has considerable multimedia holdings which includesvideos images and audio files The LOC is furthermore a morecurated collection whereas the Internet Archive draws from a widerrange of producers The Internet offers the advantages and disadvan-tages of being a more democratic archive For example a Geocitiessite (an early way for anybody to create their own website) created bya 12-year old Canadian teenager in the mid-1990s might be pre-served by the Internet Archive whereas it almost certainly would notbe saved by a national archive This difference in size and collectionsmanagement however is at the root of the changing historiansrsquotoolkit If we make use of these large data troves we can access a newand broad range of historical subjects

Conventional sources are also part of this deluge Google Bookshas been digitizing the printed word and it currently has 15 millionworks completed by 2020 it audaciously aims to digitize every bookever written While copyright issues loom the project will allowresearchers to access aggregate word and phrase data as seen in theGoogle n-gram project27 In Canada Library and Archives Canada(LAC) has a more circumscribed digitization project collecting ldquoarepresentative sample of Canadian websitesrdquo focusing particularlyon Government of Canada webpages as well as a curated approach

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

preserving case studies such as the 2006 federal election Currentlyit has 4TB of federal government information and 7TB of totalinformation in its larger web archive (a figure which as I note belowis unlikely to drastically improve despite ostensible goals of archivallsquomodernizationrsquo)28

In Canada LAC has mdash on the surface mdash has been advancingan ambitious digitization program as it realizes the challenges nowfacing archivists and historians Daniel J Caron the Librarian andArchivist of Canada has been outspoken on this front with severalpublic addresses on the question of digital sources and archivesThroughout 2012 primary source digitization at LAC preoccupiedcritics who saw it as a means to depreciate on-site access TheCanadian Association of University Teachers the Canadian HistoricalAssociation as well as other professional organizations mountedcampaigns against this shift in resources29 LAC does have a pointhowever with their modernization agenda The state of Canadarsquosonline collections are small and sorely lacking when compared totheir on-site collections and LAC does need to modernize and achievethe laudable goal of reaching audiences beyond Ottawa

Nonetheless digitization has unfortunately been used as asmokescreen to depreciate overall service offerings30 The modern-ization agenda would also see 50 percent of the digitization staffcut31 LACrsquos online collections are small they do not work withonline developers through Application Programming Interfaces(APIs a way for a computer program to talk to another computerand speed up research mdash Canadianaca is incidentally a leader in thisarea) and there have been no promising indications that this state ofaffairs will change Digitization has not been comprehensive Onlinepreservation is important and historians must fight for both on-siteand on-line access

A number of historians have recognized this challenge and areplaying an instrumental role in preserving the born-digital past TheRoy Rosenzweig Center for History and New Media at GeorgeMason University launched several high-profile preservation pro-jects the September 11th Digital Archive the Hurricane DigitalMemory Bank (focusing on Hurricanes Katrina and Rita) and as ofwriting the Occupy archive32 Massive amounts of online content

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

is curated and preserved photographs news reports blog posts andnow tweets These complement more traditional efforts of collectingand preserving oral histories and personal recollections which arethen geo-tagged (tied to a particular geographical location) tran-scribed and placed online These archives serve a dual purposepreserving the past for future generations while also facilitating easydissemination for todayrsquos audiences

Digital preservation however is a topic of critical importanceMuch of our early digital heritage has been lost Many websites espe-cially those before the Internet Archiversquos 1996 web archiving projectlaunch are completely gone Early storage mediums (from tape toearly floppy disks) have become obsolete and are now nearly inac-cessible Compounding this software development has seen earlyproprietary file formats depreciated ignored and eventually madewholly inaccessible33 Fortunately the problem of digital preservationwas recognized by the 1990s and in 2000 the American governmentestablished the National Digital Information Infrastructure andPreservation Program34 As early as 1998 digital preservationistsargued ldquo[h]istorians will look back on this era hellip and see a period ofvery little information A lsquodigital gaprsquo will span from the beginningof the wide-spread use of the computer until the time we eventuallysolve this problem What wersquore all trying to do is to shorten thatgaprdquo To bear home their successes that quotation is taken from anInternet Archive snapshot of Wired magazinersquos website35

Indeed greater attention towards rigorous metadata (which isbriefly put data about data often preserved in plain text format)open-access file formats documented by the InternationalOrganization for Standardization (ISO) and dedicated digitalpreservation librarians and archivists gives us hope for the future36

It is my belief that cloud storage the online hosting of data in largethird-party offsite centres mitigates the issue of obsolete storagemedium Furthermore that the move towards open source file for-mats even amongst large commercial enterprises such as Microsoftprovides greater documentation and uniformity

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A Brief Case Study Accessing and Navigating Canadarsquos DigitalCollections With an Emphasis on Tools for Born-DigitalSources

What can we do about this digital deluge There are no simpleanswers but historians must begin to conceptualize new additions totheir traditional research and pedagogical toolkits In the section thatfollows I will do two things introduce a case study and several toolsboth proprietary and those that are open and available off-the-shelfto reveal one way forward

For a case study I have elected to use Canadarsquos DigitalCollections (CDC) a collection of lsquodeadrsquo websites removed fromtheir original location on the World Wide Web and now digitallyarchived at LAC In this section I will introduce the collection usingdigitally-archived sources from the early World Wide Web and dis-cuss my research methodology before illustrating how we can derivemeaningful data from massive arrays of unstructured data37

This is a case study of a web archive Several of these method-ologies could be applied to other formats including large arrays oftextual material whether born-digital or not A distant reading of thistype is not medium specific It does however assume that the text ishigh quality Digitized primary documents do not always have textlayers and if they are instituted computationally through OpticalCharacter Recognition errors can appear Accuracy rates for digitizedtextual documents can range from as low as 40 percent for seven-teenth and eighteenth century documents to 80ndash90 percent formodern microfilmed newspapers to over 99 percent for typesetword-processed documents38 That said while historians today maybe more interested in applying distant reading methodologies tomore traditional primary source materials it is my belief that in thenot-so-distant future websites will be an increasingly significantsource for social cultural and political historians

What are Canadarsquos Digital Collections In 1996 IndustryCanada funded by the federal Youth Employment Strategy facili-tated the development of Canadian heritage websites by youth(defined as anyone between the age of 15 and 30) It issued contractsto private corporations that agreed to use these individuals as web

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

developers for the twofold purpose of developing digital skills andinstilling respect for and engagement with Canadian heritage andpreservation39 The collection initially called SchoolNet butrebranded as Canadarsquos Digital Collections in 1999 was accessiblefrom the CDC homepage After 2004 the program was shut downand the vast majority of the websites were archived on LACrsquos web-page This collection is now the largest born-digital Internet archivethat LAC holds worthy of study as content as well as thinkingabout methodological approaches

Conventional records about the program are unavailable in thetraditional LAC collection unsurprising in light of both its recentcreation and backlogs in processing and accessioning archival mate-rials We can use one element of our born-digital toolkit the InternetArchiversquos WaybackMachine (its archive of preserved early webpages)to get a sense of what this project looked like in its heyday The mainwebpage for the CDC was wwwcollectionsicgc However a visitortoday will see nothing but a collection of broken image links (two ofthem will bring you to the English and French-language versions ofthe LAC-hosted web archive although this is only discerniblethrough trial and error or through the sitersquos source code)

With the WaybackMachine we have a powerful albeit limitedopportunity to ldquogo back in timerdquo and experience the website as previ-ous web viewers would have (if a preserved website is available ofcourse) Among the over 420 million archived websites we can searchfor and find the original CDC site Between 12 October 1999 andthe most recent snapshot or web crawl (of the now-broken site) on 6July 2011 the site was crawled 410 times40 Unfortunately every siteversion until the 15 October 2000 crawl has been somehow cor-rupted Content has been instead replaced by Chinese characterswhich translate only into place names and other random words Issuesof early digital preservation noted above have reared their head inthis collection

From late 2000 onward however fully preserved CDC sites areaccessible Featured sites new entries success stories awards curric-ular units and prominent invitations to apply for project fundingpopulate the brightly coloured website The excitement is palpableJohn Manley then Industry Minister declared ldquoit is crucial that

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Canadarsquos young people get the skills and experience they need to par-ticipate in the knowledge-based economy of the futurerdquo and thatCDC ldquoshows that young people facing significant employment chal-lenges can contribute to and learn from the resources available onthe Information Highwayrdquo41 The documentation reflecting thelower level of technical expertise and scarcity of computer hardwareat the time was comprehensive participants were encouraged toidentify a collection digitize it process it (by cleaning up the data)preserve it on a hard drive and display it on the web Issues of copy-right were discussed as well as preservation issues and hardwarerequirements for computers (a computer with an Intel 386 processor8MB RAM a 1GB hard drive and 288bps modem was recom-mended) and training information for employers educators andparticipants42 Sample proposals community liaison ideas amidstother information set out a very comprehensive roadmap for a grass-roots bottom-up approach to digital preservation and heritage

Moving through the preserved websites is an interesting snap-shot at how web design standards changed Colour disappearsreplaced by a more subdued coordinated federal website standard(still largely in effect today) Surveys greet viewers from 2003onwards trying to learn how viewers discovered the site whetherthey think it is a good educational resource (they must have beenrelieved that the majority thought so mdash in 2003 some 84 percentfelt that it was a ldquogoodrdquo resource the best of its type)43 By 2004 theprogram was engaging in active remembrance of its past programsas it neared its ten-year anniversary Proposals stopped beingaccepted in July 2004 but the 600 webpages were still accessiblewith a comprehensive subject index alphabetical listing success sto-ries and detailed copyright information

The Canadarsquos Digital Collections site began to move towardsremoval in November 2006 and was initially completely gone A visi-tor to the site would learn that the content ldquois no longer availablerdquo andan e-mail address was provided for contributors to get their informationback44The information was completely inaccessible and by early 2007a bold notice announced that the now-empty site would close as of 2April 2007 Luckily by April 2007 a notice appeared with a declara-tion that the collections were now archived by Library and Archives

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Canada just in time for the website to completely disappear by July2007 In the archived version subject indexes and headings were goneAll that remained then and today for that matter is a straight-forwardalphabetical list and a disclaimer that noted ldquo[y]ou are viewing a doc-ument archived by Library and Archives Canada Please noteinformation may be out of date and some functionality lostrdquo

The question now is what we can do with this collection ofwebsites Could a distant reading as opposed to a close reading ona targeted level tell us something about this collectionFortunately we have a variety of tools that can help us navigate thislarge array of information several of which I use to navigate largeamounts of data First I create my own tools through computerprogramming This work is showcased in the following section I program in Wolfram Researchrsquos Mathematica 9 Mathematica is anintegrated platform for technical computing that allows users toprocess visualize and interact with exceptional arrays of informa-tion45 It is a language with an engaged and generous user base thatcomplements exhaustive well-written documentation fromWolfram Research To emphasize its ease one line of code couldimport a website in perfectly readable plaintext (avoiding unicodeerrors) as follows input=Import[ldquohttpactivehis-torycaldquordquoPlaintextrdquo] and a second one could make itall in lower text for textual analysis as such lowerpage=ToLowerCase[input] Variables are a critical building blockIn the former example ldquoinputrdquo now holds the entire plaintext of thewebsite ActiveHistoryca and then ldquolowerpagerdquo holds it in lowercaseOnce we have lower-cased information we are quickly able to extractthe following frequent words frequent phrases and could thus cre-ate a dynamic database of n-grams word clouds or run other textualanalysis programs on the material Much of the syntax adheres to arecognizable English-language format

Even more promising is Mathematicarsquos integration with theWolfram|Alpha database (itself accessible at wolframalphacom) acomputational knowledge engine that promises to put incredibleamounts of information at a userrsquos fingertips With a quick com-mand one can access the temperature at Torontorsquos Lester B PearsonInternational Airport at 500 pm on 18 December 1983 the rela-

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

tive popularity of names such as ldquoEdithrdquo or ldquoEmmardquo the Canadianunemployment rate in March 1985 or the Canadian GrossDomestic Product in December 199546 This allows one to quicklycross-reference findings with statistical information or find coinci-dental points of comparison As universities are now being activelyencouraged to contribute their information to Wolfram|Alpha thiswill become an increasingly powerful database

For many researchers the downside of Mathematica is a con-siderable one It is proprietary software and relatively expensive forstudent and non-student alike Outputs however can be dissemi-nated using the Computational Document Format (CDF) enablinghistorians to put their work online and entice the general public todynamically work with their models and findings

Programming is not the be all and end all Information visual-ization in the humanities is a growing international field withCanadian scholars heavily involved The most important and acces-sible example is Voyant Tools47 It is a powerful suite of textualanalysis tools Uploading onersquos own text or collection of texts draw-ing on a webpage or analyzing existing bodies of work one sees thefollowing a ldquoword cloudrdquo of a corpus click on any word to see itsrelative rise and fall over time (word trends) access the original textand click on any word to see its contextual placement within indi-vidual sentences It ably facilitates the ldquodistancerdquo reading of anyreasonably-sized body of textual information and should be a begin-ning point for those interested in this field The only downside isthat Voyant can choke on very large amounts of information as it isa textual analysis rather than robust data mining tool

There are several other complementary tools A notable one is theMany Eyes website hosted by IBM Research Many Eyes allows you toupload datasets be they textual tabular or numerical There are manyvisualization options provided you can plot relationships betweendata compare values (ie bar charts and bubble charts) track dataover time and apply information to maps The only downside toMany Eyes is that you must register and make your datasets available toall other registered researchers as well as IBM This limits your abilityto use copyrighted or proprietary information

More advanced technical users can avail themselves of a host of

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

textual analysis programs found elsewhere on the web The mostcomprehensive gathering of these tools can be found with theSoftware Environment for the Advancement of Scholarly Research(SEASR) and its Meandre workbench Meandre which can beinstalled as a local server on your system or mdash in a best case scenariomdash hosted by an institutional IT department allows you to tap intoa variety of workflows to analyze data48 One workflow mentionedin this article is the topic modeling toolkit found within the MachineLearning for Language Toolkit (MALLET) package which clustersunstructured data into ldquotopicsrdquo which appear together giving a quicksense of a documentrsquos structure49 Other workflows found withinSEASR include Dunningrsquos log-likelihood statistic which compares aselected document to a broader reference corpus For example youcould compare the lyrics that appear in 1968 against all lyrics in thepostwar era and learn what makes 1968 most unique50 sentimentanalyses which attempt to plot changes in emotional words andphrases over time and a variety of entity extraction algorithms(which find ldquopeoplerdquo ldquoplacesrdquo and so forth in a text) Corpuses thatwere previously too big to extract meaningful information from cannow be explored relatively quickly

As befits a technical side to history all of the freely availabletools discussed briefly here (notably wget SEASR and MALLET)are discussed in more technical terms in the peer-reviewedProgramming Historian 2 or on my own blog at ianmilliganca I haveelected to leave them online rather than provide commands in thisarticle they require some technical knowledge and directly workingfrom the web is quicker than here Please consider the following sec-tion an applied version of some of the lessons you can learn throughthese processes

From Web Collection to Dynamic Finding Aids

Let us briefly return to our archival metaphor Todayrsquos historiandealing with conventional sources often has to physically travel to anarchive requisition boxes perhaps take photographs of materials iftime is tight and otherwise take notes With born-digital sourceshowever the formula is reversed Our sources can come to us In a

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

few keystrokes we can bring an unparalleled quantity of informationto our home or work computers With Canadarsquos Digital Collectionsa single command line entry brought the entire archive onto my sys-tem it took days to process the information and I had to leave mycomputer on but it did not require any active user engagement

How big is the CDC collection in terms of sheer informationThe plaintext without any HTML markup is approximately7805135 words If each page contained 250 words it would be theequivalent of a 31220 page essay or 1000 dissertations However itis even more complicated than just reading 1000 dissertations Theinformation is spread across 73585 separate HTML files themselvespart of a much broader corpus of 278785 files Even the sheer sizeof the database requires big data methodologies an aggregate size of726 GB 360 MB of plaintext with HTML encoding reduced to alsquomerersquo 50MB if all encoding is removed A 50MB plain text docu-ment can quickly slow down even a relatively powerful personalcomputer There is no simple way to make this a conventionallyaccessible amount of information

How can we access this information Several different methodsare possible First we can work with the information on the internetitself or we can download it onto our own system I prefer the lattermethod as it decreases computational time adds flexibility andenables free rein to experiment with material Before you do thisthough you need to be familiar with the websitersquos ldquoterms of servicerdquoor the legal documentation that outlines how users are allowed to usea website resources or software In this case LAC provides generousnon-commercial reproduction terms information ldquomay be repro-duced in part or in whole and by any means without charge orfurther permission unless otherwise specifiedrdquo51

The CDC web archive index is organized like so the index islocated at wwwepelac-bacgcca100205301iccdcEAlphabetaspwith every subsequent website branching off the cdc folder Forexample one website would be hosted in wwwepelac-bacgcca100205301iccdcABbenevolat the next in hellipcdcabnatureand so on Using wget a free software program you can downloadall websites in a certain structure the program recursively downloadsinformation in a methodical fashion I first used wget to save the

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

cookies (little files kept on your system in this case one that notedthat you had seen the disclaimer about the archival nature of thewebsite) and then with an one-line command in my command lineto download the entire database52 Due to the imposition of a ran-dom wait and bandwidth limit which meant that I would downloadone file every half to one and a half seconds at a maximum of 30kilobytes a second the process took six days In the ProgrammingHistorian 2 I have an article expanding upon wget for humanitiesresearchers

Hypertext itself deserves a brief discussion The basic buildingblock of the web is the link one page is connected to another andpeople navigate through these webs While books and articles forexample are connected to each other by virtue of ideas and concretecitations these must often be inferred online HTML born-digitalsources are different in that the link offers a very concrete file One

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Visualizing the CDC Collection as a Series of Hyperlinks

quick visualization the circular cluster seen above lets us see theunique nature of this information We can go to each web pagescrape the links and see where they take viewers As we see in thebelow figure the CDC collection can be visualized as a series ofhyperlinks (each line) The cluster in the middle is the CDC collec-tion itself whereas the strands on the outside represent externalwebsites Using Mathematica I have generated a dynamic modelyou can move your mouse over each line learn what link it is andthus see where the collection fits into the broader Internet The siteis limited to following two links beyond the first page Through thisapproach we see the interconnection of knowledge and relationshipsbetween different sources At a glimpse we can see suggestions to fol-low other sites and determine if any nodes are shared amongst thevarious websites revealed

Visualization Moving Beyond Word Clouds

Where to begin with a large collection such as the Canadarsquos DigitalCollection corpus As a historian who is primarily interested in thewritten word my first step was to isolate the plaintext throughout thematerial so that I could begin rudimentary data mining and textualanalysis on the corpus For this I approached the CDC collection ontwo distinct levels mdash first the aggregate collection and second divid-ing the collection into each lsquofilersquo or individual website mdash derivingword and phrase data for each Imagine if you could look throughinto an archival collection and get a quick sense of what topics andideas were covered Taking that thought further imagine the utility ifyou could do this with individual archival boxes and even files Witha plethora of born-digital sources this is now a necessary step

On an aggregate level word and phrase data is a way to explorethe information within How often do various words within thenearly eight million words of the collection appear UsingMathematica we can quickly generate a list of words sorted by theirrelative frequency we learn that the word ldquoCanadardquo appears 26429times ldquocollectionsrdquo 11436 and ldquo[L]ouisbourgrdquo an astounding9365 times Reading such a list can quickly wear out a reader and isof minimal visual attractiveness Two quick decisions were made to

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

enhance data usability first the information was ldquocase-foldedrdquo allsent to lower-case this enables us to count the total appearances of aword such as ldquodigitalrdquo regardless of whether or not it appears at thebeginning of a sentence

Second when dealing with word frequency mdash as opposed tophrase frequency mdash ldquostop wordsrdquo should be removed as they werehere These are ldquoextremely common words which would appear to beof little value in helping select documents matching a user needrdquowords such as ldquotherdquo ldquoandrdquo ldquotordquo and so forth They matter forphrases (the example ldquoto be or not to berdquo is comprised entirely ofstop words) but will quickly dominate any word-level visualizationor list53 Other words may need to be added or removed from stoplists depending on the corpus for example if the phrase ldquopage num-berrdquo appeared on every page that would be useless information forthe researcher

The first step then is to use Mathematica to format word fre-quency information into Wordlenet readable format (if we put allof the text into that website it will crash mdash so we need to followthe instructions on that website) we can then draw up a wordcloud of the entire collection Here we can see both the advantagesand disadvantages of static word clouds What can we learn from this

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A Word Cloud of All Words in the CDC Corpus

illustration First without knowing anything about the sites we canimmediately learn that it is dealing with CanadaCanadians that itis a collection with documents in French and English that we aredealing with digitized collections and that there is a fair amount ofregional representation not based upon population (note thatldquoontariordquo is less frequent than both ldquonewfoundlandrdquo ldquonova scotiardquoand ldquoalbertardquo although ldquoqueacutebecrdquo and ldquoquebecrdquo is not terribly promi-nent) Certain kinds of history dominate communities familynames the built form (houses churches schools and so forth) At avery macro level we can learn several characteristics before doing anydeeper investigation

These forms of visualization need to be used with extreme cau-tion however First while visually attractive it leaves a severeamount of ambiguity are ldquoontariordquo and ldquoqueacutebecrdquo equally promi-nent for example or is there a slight divergence Even moreproblematically we cannot learn context from the above picturedoes the word ldquoschoolrdquo refer to built structures or educational expe-riences The same question with applies to ldquochurchrdquo Do mentionsof ldquonewfoundlandrdquo refer to historical experiences there or withexpatriates or in travel narratives Word clouds are useful first stepsand have their place but need to be used with extreme caution Formore scholarly applications tables offer much more precise inter-pretation As the second figure below demonstrates they can bemore helpful when dealing with the individual sites cuttingthrough ambiguity The figure below is a visualization of the web-site ldquopast and presentrdquo Despite the ambiguous title we can quicklylearn what this website contains from this visualization Wordle isthus a useful first step analogous to taking magically taking a quicklook through the side of an archival box We still need more infor-mation however to really make this form of textual analysishelpful To navigate this problem I have created my own dynamicfinding aid program using Mathematica which scales to any born-digital web collection

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Two approaches were taken First a textual representation of wordand phrase frequency allows us to determine the content of a given filerelatively quickly I wrote a short program to extract word frequency(stop words excluded) and phrase frequency For phrases we speak ofn-grams bigrams trigrams quadgrams and fivegrams although onecould theoretically do any higher number A bigram can be a combina-tion of two characters such as lsquobirsquo two syllables or mdash pertinent in ourcase mdash two words An example ldquocanada isrdquo A trigram is three wordsquadgram is four words and a fivegram is five words and so on

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Word Cloud of the File ldquoPast and Presentrdquo

A viewer for n-grams in the CDC Collection

The CDC collection demonstrates this approachrsquos utility Usingmy Mathematica textual CDC browser pictured above I randomlyexplored file 267 the ambiguously named ldquopast and presentrdquo web-site This is one of the largest individual websites 122363 words orabout 493 standard typed pages Such an ambiguous name could beabout almost anything Initially I decided to study word frequency

ldquoldquoalbertardquo 758rdquo ldquoldquolistenrdquo 490rdquo ldquoldquosettlersrdquo 467rdquoldquoldquonewrdquo 454rdquo ldquoldquocanadardquo 388rdquo ldquoldquolandrdquo 362rdquo ldquoldquoreadrdquo328rdquo ldquoldquopeoplerdquo 295rdquo ldquoldquosettlementrdquo 276rdquo ldquoldquochineserdquo268rdquo ldquoldquocanadianrdquo 264rdquo ldquoldquoplacerdquo 255rdquo ldquoldquoheritagerdquo254rdquo ldquoldquoedmontonrdquo 250rdquo ldquoldquoyearsrdquo 238rdquo ldquoldquowestrdquo 236rdquoldquoldquoraymondrdquo 218rdquo ldquoldquoukrainianrdquo 216rdquo ldquoldquocamerdquo 213rdquoldquoldquocalgaryrdquo 209rdquo ldquoldquonamesrdquo 206rdquo ldquoldquoranchrdquo 204rdquoldquoldquoupdatedrdquo 193rdquo ldquoldquohistoryrdquo 190rdquo ldquoldquocommunitiesrdquo183rdquo hellip

Rather than having to visit the website at a glance we can almostimmediately see what the site is about More topics become apparentwhen one looks at trigrams for example in this automatically gen-erated output of generated top phrases

ldquoldquofirstrdquo ldquopeoplerdquo ldquosettlersrdquo 100rdquo ldquoldquotherdquo ldquounitedrdquoldquostatesrdquo 72rdquo ldquoldquoonerdquo ldquoofrdquo ldquotherdquo 59rdquo ldquoldquoofrdquo ldquotherdquoldquowestrdquo 58rdquo ldquoldquosettlersrdquo ldquolastrdquo ldquoupdatedrdquo 56rdquo ldquoldquopeoplerdquoldquosettlersrdquo ldquolastrdquo 56rdquo ldquoldquoalbertansrdquo ldquofirstrdquo ldquopeoplerdquo 56rdquoldquoldquoadventurousrdquo ldquoalbertansrdquo ldquofirstrdquo 56rdquo ldquoldquotherdquo ldquobarrdquoldquourdquo 55rdquo ldquoldquoopeningrdquo ldquoofrdquo ldquotherdquo 54rdquo ldquoldquocommunitiesrdquoldquoadventurousrdquo ldquoalbertansrdquo 54rdquo ldquoldquonewrdquo ldquocommunitiesrdquoldquoadventurousrdquo 54rdquo ldquoldquobarrdquo ldquourdquo ldquoranchrdquo 53rdquo ldquoldquolistenrdquoldquotordquo ldquotherdquo 52rdquo hellip

Through a few clicks we can navigate and determine the basiccontours mdash from a distance mdash of what the site pertains to In aviewer you can use a slider to dynamically see more informationor less information quickly toggling between bigrams trigramsand quadgrams allowing you to quickly hone in on immediately

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

relevant information This can be useful but we need to moveback even further to achieve a distant reading of this born-digitalcollection

If one zooms back out to this very distant level we can extractn-gram data from the entire collection We have two approachesfirst aggregate appearances of a given term (if ldquoaboriginalrdquo appears457 times then n=457) or second relative appearances of a giventerm (if ldquoaboriginalrdquo appears in 45 of 384 sites then n=45384 or01171875) In my other work on music lyrics the latter optionenables you to control for the lsquochorusrsquo effect as the former can be drowned out by one song with repetitive words Given thenature of this collection I elected for the former approach Whatcan we learn from trigrams for example After an initial set of trigrams pertaining to the Louisbourg digitization project whichhas boilerplate language on each page as well as common ldquostop-wordrdquo equivalents of phrases (especially pronounced in French) we come across the first set of meaningful phrases excerptedbelow

ldquoexperiencerdquo ldquoinrdquo ldquotherdquo 919 ldquotransferredrdquo ldquotordquo ldquotherdquo918 ldquofundedrdquo ldquobyrdquo ldquotherdquo 918 ldquoofrdquo ldquotherdquo ldquoprovincialrdquo915 ldquothisrdquo ldquowebrdquo ldquositerdquo 915 ldquo409rdquo ldquoregistryrdquo ldquo2rdquo914 ldquoarerdquo ldquonordquo ldquolongerrdquo 914 ldquogouvernementrdquo ldquodurdquo ldquocanadardquo 910 ldquouserdquo ldquothisrdquo ldquolinkrdquo 887ldquopagerdquo ldquoorrdquo ldquoyourdquo 887 ldquofollowingrdquo ldquopagerdquo ldquoorrdquo 887 ldquoautomaticallyrdquo ldquotransferredrdquo ldquotordquo 887 ldquoberdquoldquoautomaticallyrdquo ldquotransferredrdquo 887 ldquowillrdquo ldquoberdquo ldquoautomaticallyrdquo 887 ldquoindustryrdquo ldquoyourdquo ldquowillrdquo 887ldquomultimediardquo ldquoindustryrdquo ldquoyourdquo 887 ldquopracticalrdquo ldquoexperiencerdquo ldquoinrdquo 887 ldquogainedrdquo ldquopracticalrdquo ldquoexperiencerdquo887 ldquotheyrdquo ldquogainedrdquo ldquopracticalrdquo 887 ldquowhilerdquo ldquotheyrdquoldquogainedrdquo 887 ldquocanadiansrdquo ldquowhilerdquo ldquotheyrdquo 887ldquoyoungrdquo ldquocanadiansrdquo ldquowhilerdquo 887 ldquoshowcaserdquo ldquoofrdquoldquoworkrdquo 887 ldquoexcellentrdquo ldquoshowcaserdquo ldquoofrdquo 887 hellip

Here we see an array of common language funding informa-tion link information project information spread across over 887

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

websites Reading down we see information about governments jobexperience project information initiatives re-structuring and soforth At a glance we can see what language is common to all web-sites as opposed to unique information Moving down we find themore common but less universal ldquothe history ofrdquo for example ldquoher-itage community foundationrdquo libraries and so forth By the end ofthe list we can move into the unique or very rare phrases that appearonly once twice or three times

Moving beyond word and phrase frequency there are a fewmore sophisticated off-the-shelf options that can help us make senseof large arrays of unstructured data Topic modeling is among themost developed as Princetonrsquos David M Blei (a Computer Scienceprofessor) describes in a general introduction to the topic

While more and more texts are available online we simplydo not have the human power to read and study them toprovide the kind of [traditional] browsing experiencedescribed above To this end machine learning researchershave developed probabilistic topic modeling a suite of algo-rithms that aim to discover and annotate large archives ofdocuments with thematic information Topic modelingalgorithms are statistical methods that analyze the words ofthe original texts to discover the themes that run throughthem how those themes are connected to each other andhow they change over time hellip Topic modeling enables usto organize and summarize electronic archives at a scalethat would be impossible by human annotation55

Using the aforementioned SEASR Meandre workbench (or viaMALLET itself and a command-line interface) we can quickly beginto run our born-digital information through the topic-modelingalgorithm The results are notable and come in two formats a wordcloud visualization as well as an XML file containing a line-by-linebreakdown of each topic Taking the aforementioned file 267 (ldquopastand presentrdquo) we can begin to get a sense of the topics this large web-site discusses

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

At a glance we can learn about some of the main themes the struc-ture of Alberta (Alberta towns bridges lakes) the geographic shapeof the area (life land lands areas links grass and interestingly lan-guage) the stories (story knight archives but also interestinglyknight and wife) and so forth The word cloud gives us a sense ofthe most relevant terms that define each topic but we also have a3200-line XML file broken into topics which lets us further refineour inquiry

A last set of examples will help us move from the aggregate levelof hundreds of thousands of lines of text to the individual line inessence creating our own search engine to navigate this collection

There are a few good reasons why we should not use Google todo all of this in our own work First we do not know how Googleworks precisely It is a proprietary black box While the original algo-rithm is available the actual methods that go into determining theranking of search terms is a trade secret Second Google is optimizedfor a number of uses mainly to deliver the most relevant informa-tion to the user Historians and archivists are looking often forspecific information within old websites and it behooves us to exam-ine them through the lens of more specialized tools With thesehesitations in mind I have sought to develop my own rudimentarysearch engine for this born-digital collection

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A two-pronged approach can take us to relevant informationquickly In the following example I am carrying out a project onblack history We quickly search all of the CDC word frequency filesfor the occurrence of the word ldquofreedomrdquo ldquoblackrdquo and ldquolibertyrdquo(these terms of course require some background knowledge thesearch is shaped by the user) obtain a list and then map it onto achart in order to quickly visualize their locations in the collection Forexample in the graph above we see a visualization of a given term inthis case our three search terms As one moves the mouse over eachbar the frequency data for a given term appears We can also zoomand enhance this graph to use it as we wish A demonstration of thiscan be found online at wwwianmilliganca20130118cdc-tool-1

We can then move down into the data In the example aboveldquo047rdquo labeled ldquoblackgoldtxtrdquo contains 31 occurrences of the wordldquoontariordquo Moving to my Mathematica keyword-in-context routinewe can see where these occurrences of the word appear

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

A snapshot of the dynamic finding aid tool

We see that this website focuses mostly on southwestern Ontariowhere many oil fields are If we move forward we can see that it per-tains to the nineteenth century a particular region as well assubsequent memorialization activities (mention of a museum forexample) All this can be done from within the textual client asopposed to going out into individual websites Broad contoursemerge from this distant reading but they also facilitate deeper read-ing by allowing us to promptly realize critical sources

Another model through which we can visualize documents isthrough a viewer primarily focusing on term document frequencyand inverse frequency terms discussed below To prepare this mater-ial we are combining our earlier data about word frequency (that theword ldquoexperiencerdquo appears 919 times perhaps) with aggregate infor-mation about frequency in the entire corpus (ie the wordldquoexperiencerdquo appears 1766 times) The number of times that a termappears in a given website can be defined as tf (term frequency)Another opportunity is inverse document frequency where we canlearn what words are most distinct and unique in a given documentThe formula for this is the oft-cited tf-idf a staple of informationretrieval theory or tf times the logarithm of the number of docu-ments divided into the number of documents where the given termappears At a glance then we can see some of the unique words con-tained within a website if we select ldquoinverse document frequencyrdquo

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

At a very quick glance one can see the major words that make this aunique document This technique can be used on any textual docu-ment

Off-the-shelf tools can facilitate a deeper yet distant reading ofindividual websites as well especially the suite of research tools onwebsites such as Many Eyes and Voyant Using these tools howeverrequires that one puts individual data into each separate programrather than quickly navigating between them This stems from themassive amount of files and data involved Many Eyes limits files to5MB and Voyant has no hard and fast rule but can struggle withlarger files

This analysis was carried out on my personal computer55 Largerdata sets may necessitate a corresponding increase in computationalpower bringing social sciences and humanities researchers into con-tact with High Performance Computing (HPC) centres or consortiaThese extremely powerful computers provide researchers with com-puting power orders of magnitude above that of personal computersallowing results in hours instead of the weeks or months it might

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

take to process on a smaller system56 This has particular applicationin the basic processing of a very large web archive There are limita-tions however jobs must be submitted into queues in advancerather than on-site processing and perhaps most importantly HPCcentres prefer to run highly optimized code to minimize usage oftheir very expensive and scarcely allocated equipment As this level ofcoding expertise is often beyond humanities researchers researchteams must be formed

Humanities researchers would do well to familiarize themselveswith local HPC options many of which are more than happy towork on new and engaging questions These are accessible optionsin Ontario for example all but one university belongs to a HPCconsortium57 Yet as personal computing continues to dramaticallyincrease in speed and as storage prices continue to plummet textualanalysis is increasingly feasible on high-quality office systems An ini-tial exposure can generate questions and problems that if the needrears itself can subsequently be taken to the supercomputer

This case study is intended to provide one model for distantreading and the generation of finding aids in the interest of showingsome of the potential for working with born-digital sources It isintended to begin or jumpstart a conversation around the resourcesand skills necessary for the next generation A seemingly insur-mountable amount of information contained across over 73000files (as in the CDC collection) can be quickly navigated and iso-lated for researcher convenience

Conclusion Digital Literacy for the Next Generation ofHistorians

As a discipline I would like to suggest one path forward prepare thenext generation of historians at least some of them to deal with thedeluge of born-digital sources Fundamentally the challenge of born-digital sources is a pedagogical one For existing historians this is anissue of continual education for future historians an issue of profes-sional training (the annual Digital Humanities Summer Instituteheld at the University of Victoria each June is worth considering asan introduction to many digital concepts) Some off-the-shelf tools

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

can serve as gateways to deeper engagement notably the VoyantTools suite (wwwvoyant-toolsorg) and perhaps subsequently thelessons contained in the Programming Historian 2 That said digitalliteracy courses should become part of the essential pantheon of his-torical curricula This should not be mandatory but instead taughtalongside and within traditional historiography and research meth-ods courses In an era of diminishing resources and attention paid toknowledge translation and relevancy digital literacy courses are alsowell positioned to provide graduates with concrete portfolios andactionable skills If we want to be prepared to use digital method-ologies to study the post-1990 world we will need ready historians

Such a course should have six basic priorities First we must pro-vide historiographical context for new students on whatcomputational methods have been tried before How have quantita-tive and qualitative methods differed yet ultimately complementedeach other Second students in this course would need to develop anunderstanding of the digital basics cloud computing effectivebackup and versioning digital security and an understanding of howthe Internet is organized Third for effective cross-departmental andsustained engagement with digital history students must to learn howto digitally navigate conventional sources notable skill sets wouldinclude a familiarity with citation management software databasesand the possibility of using optical character recognition on a varietyof born-paper sources subsequently digitized through scanners orcameras This must also include a grasp of the scope and significanceof large digital depositories including Google Books the InternetArchive and others mentioned earlier Fourth students must learnhow to process large amounts of digital information this includes off-the-shelf visualization tools material discussed here and some of thephilosophical and methodological implications of a distance readingFifth students can learn basic programming through the ProgrammingHistorian Finally and perhaps most importantly students need anunderstanding of various ways to openly disseminate material on the web whether through blogging micro-blogging publishing webpages and so forth it is important to enable the great work that we do in the university to make its mark on the world Throughthis students will also come away from history departments with

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

comprehensive and marketable digital portfolios Such a coursewould focus more on homework than reading with special emphasisplaced on deliverable content such as websites interactive modelsand broadly implementable databases

Digital history and the readiness to work with born-digitalsources cannot stem out of pedagogy alone however Institutionalsupport is necessary While based on anecdotal evidence it appearsthat some university administrators have begun to embrace the shifttowards the digital humanities At the federal level SSHRC has iden-tified the Digital Economy as one of its priority areas58 Yet when itcomes to evaluation doctoral and post-doctoral researchers do nothave a ldquodigital humanitiesrdquo option open to them in terms of assessingcommittees leading to fears that their projects might be lost inbetween disciplinary areas of focus59 In light of the impending crisisof born-digital sources historians should encourage these initiatives

These sorts of digital projects do not always lend themselveswell to traditional publication venues The examples above from theCDC case study for example are best viewed as dynamically mal-leable models this will be one challenge as historians grapple withnew publication models This article argues first why digitalmethodologies are necessary and second how we can realize aspectsof this call The Internet is 20 years old and before historians realizeit we will be faced with the crisis of born-digital information over-load If history is to continue as the leading discipline inunderstanding the social and cultural past decisive movementtowards the digital humanities is necessary When the sea comes inwe must be ready to swim

IAN MILLIGAN is an assistant professor of Canadian history at theUniversity of Waterloo He is a founding co-editor of ActiveHistorycaan editor-at-large of the Programming Historian 2 and has written sev-eral articles in the fields of Canadian youth digital and labour history

IAN MILLIGAN est professeur adjoint en histoire canadienne agrave laUniversity of Waterloo Il est le fondateur et codirecteur du site

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Internet ActiveHistoryca ainsi qursquoun eacutediteur du site ProgrammingHistorian 2 Il a eacutecrit plusieurs articles au sujet de la jeunesse cana-dienne de la numeacuterisation et de lrsquohistoire des travailleurs

Endnotes

1 The question of when the past becomes a ldquohistoryrdquo to be professionallystudied is a tricky one Yet it is worth noting that the still-definitivework on the Baby Boom Generation appeared in 1996 with DougOwram Born at the Right Time A History of the Baby Boom Generationin Canada (Toronto University of Toronto Press 1996) This actuallyfollowed the earlier scholarly monograph Cyril Levitt Children ofPrivilege Student Revolt in the Sixties (Toronto University of TorontoPress 1984) and a host of other laudable works As academicdebatesworks on the 1960s now suggest the past becomes historysooner than we often expect

2 A critical point advanced in the field of literary criticism and theory byStephen Ramsay Reading Machines Towards an Algorithmic Criticism(Urbana University of Illinois Press 2011)

3 See ldquoIdleNoMore Timelinerdquo ongoingwwwidlenomoremakookcatimeline44 ltviewed 18 January 2013gt

4 These issues have been discussed elsewhere albeit in either a moreabridged form or with a focus on digitized historical sources A notableexception is Daniel J Cohen and Roy Rosenzweig Digital History AGuide to Gathering Preserving and Presenting the Past on the Web(Philadelphia University of Pennsylvania Press 2005) specifically theirchapter on ldquoCollecting History onlinerdquo Their focus is on the preservation-ist and archival challenges however Important notes are also found inDan Cohen ldquoHistory and the Second Decade of the Webrdquo RethinkingHistory 8 no 2 (June 2004) 293ndash301 also available online atwwwchnmgmueduessays-on-history-new-mediaessaysessayid=34

5 Figure taken from Jose Martinez ldquoTwitter CEO Dick Costolo Reveals Staggering Number of Tweets Per Dayrdquo ComplexTech wwwcomplexcomtech201210twitter-ceo-dick-costolo-reveals-staggering-number-of-tweets-per-day ltviewed 6 January 2013gtFurther more while a tweet itself is only 140 characters every tweetcomes with over 40 lines of metadata date information author biogra-phy and location information when the account was created the usersrsquofollowing and follower numbers country data and so forth This is aremarkable treasure trove of information for future historians but also apotential source of information overload See Sarah Perez ldquoThis is What

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

a Tweet Looks Likerdquo ReadWriteWeb (19 April 2010) wwwreadwritewebcomarchivesthis_is_what_a_tweet_looks_likephpltviewed 6 January 2013gt

6 This is further discussed in Ian Foster ldquoHow Computation ChangesResearchrdquo in Switching Codes Thinking Through Digital Technology inthe Humanities and the Arts eds Thomas Bartscherer and RoderickCoover (Chicago University of Chicago Press 2011) esp 22ndash4

7 Franco Moretti Graphs Maps Trees Abstract Models for Literary History(New York Verso 2005) 3ndash4

8 While not of prime focus here the longue dureacutee deserves definition Itstands opposite the history of events of instances covering instead avery long time span in an interdisciplinary social science framework In along essay Braudel noted that one can then see both structural crises ofeconomies and structures that constrain human society and develop-ment For more see Fernand Braudel ldquoHistory and the Social SciencesThe Longue Dureacuteerdquo in Fernand Braudel On History trans SarahMatthews (Chicago University of Chicago Press 1980) 27 The essaywas originally printed in the Annales ESC no 4 (OctoberndashDecember1958)

9 A great overview can be found at ldquoText Analysisrdquo Tooling Up for DigitalHumanities 2011 wwwtoolingupstanfordedupage_id=981 ltviewed19 December 2012gt

10 Jean-Baptiste Michel Yuan Kui Shen Aviva Presser Aiden Adrian VeresMatthew K Gray The Google Books Team Joseph P Pickett DaleHoiberg Dan Clancy Peter Norvig Jon Orwant Steven Pinker MartinA Nowak Erez Lieberman Aiden ldquoQuantitative Analysis of CultureUsing Millions of Digitized Booksrdquo Science 331 (14 January 2011)176ndash82 The tool is publicly accessible at ldquoGoogle Books NgramViewerrdquo Google Books 2010 wwwbooksgooglecomngrams ltviewed11 January 2013gt

11 Anthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

12 Comment by Jean-Baptiste Michel and Erez Lieberman Aiden onAnthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

13 For more information see the Canadian Century Research InfrastructureProject website at wwwccriuottawacaCCRIHomehtml and the Programme de recherche en deacutemographie historique at wwwgeneologyumontrealcafrleprdhhtm

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

14 ldquoWith Criminal Intentrdquo CriminalIntentorg wwwcriminalintentorgltviewed 20 December 2012gt

15 For an overview see ldquoAward Recipients mdash 2011rdquo Digging into DataWebpage wwwdiggingintodataorgHomeAwardRecipients2011tabid185Defaultaspx ltviewed 20 December 2012gt

16 One can do no better than the completely free Programming Historianwhich uses entirely free and open-source programming tools to learn thebasics of computational history See The Programming Historian 2 2012-Present wwwprogramminghistorianorg ltviewed 11 January 2013gt

17 Michael Katz The People of Hamilton Canada West Family and Class ina Mid-Nineteenth-Century City (Cambridge MA Harvard UniversityPress 1975) Also see A Gordon Darroch and Michael D OrnsteinldquoEthnicity and Occupational Structure in Canada in 1871 The VerticalMosaic in Historical Perspectiverdquo Canadian Historical Review 61 no 3(1980) 305ndash33

18 Ian Anderson ldquoHistory and Computingrdquo Making History The ChangingFace of the Profession in Britain 2008 wwwhistoryacukmakinghistoryresourcesarticleshistory_and_computinghtml ltviewed 20 December2012gt

19 Christopher Collins ldquoText and Document Visualizationrdquo TheHumanities and Technology Camp Greater Toronto Area VisualizationBootcamp University of Ontario Institute of Technology Oshawa Ont21 October 2011

20 The quotations by Luther and Poe are commonly used to make thispoint For the best example see Clay Shirky Cognitive SurplusCreativity and Generosity in a Connected Age Google eBook Penguin2010 For Mumford see Lewis Mumford The Myth of the Machine vol2 The Pentagon of Power (New York Harcourt Brace 1970) 182 asquoted in Gleick The Information A History A Theory A Flood (NewYork Pantheon 2011) 404

21 Adam Ostrow ldquoCisco CRS-3 Download Library of Congress in JustOver One Secondrdquo Mashable (9 March 2010) accessed onlinewwwmashablecom20100309cisco-crs-3

22 A note on data size is necessary The basic unit of information is a ldquobyterdquo (8 bits) which is roughly a single character A kilobyte (KB) is athousand bytes a megabyte (MB) a million a gigabyte (GB) a billion aterabyte (TB) a trillion and a petabyte (PB) a quadrillion Beyond thisthere are exabytes (EB) zettabytes (ZB) and yottabytes (YB)

23 These figures can of course vary depending on compression methodschosen Their math was based on an average of 8MB per book if it wasscanned at 600dpi and then compressed There are 26 million books inthe Library of Congress For more see Mike Ashenfelder ldquoTransferringlsquoLibraries of Congressrsquo of Datardquo Library of Congress Digital Preservation

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Blog 11 July 2011wwwblogslocgovdigitalpreservation201107transferring-libraries-of-congress-of-data ltviewed 11 January 2013gt

24 Gleick 232 25 ldquoAbout the Library of Congress Web Archivesrdquo Library of Congress Web

Archiving FAQs Undated wwwlocgovwebarchivingfaqhtmlfaqs_02ltviewed 11 January 2013gt

26 ldquo10000000000000000 Bytes Archivedrdquo Internet Archive Blog 26October 2012 wwwblogarchiveorg2012102610000000000000000-bytes-archivedltviewed 19 December 2012gt

27 Jean-Baptiste Michel et al 176ndash82 Published Online Ahead of Print16 December 2010 wwwsciencemagorgcontent3316014176ltviewed 29 July 2011gt See also Scott Cleland ldquoGooglersquos PiracyLiabilityrdquo Forbes 9 November 2011 wwwforbescomsitesscottcle-land20111109googles-piracy-liability ltviewed 11 January 2013gt

28 ldquoGovernment of Canada Web Archiverdquo Library and Archives Canada 17October 2007 wwwcollectionscanadagccawebarchivesindex-ehtmlltviewed 11 January 2013gt

29 For background see Bill Curry ldquoVisiting Library and Archives inOttawa Not Without an Appointmentrdquo Globe and Mail 1 May 2012wwwtheglobeandmailcomnewspoliticsottawa-notebookvisiting-library-and-archives-in-ottawa-not-without-an-appointmentarticle2418960 ltviewed 22 May 2012gt and the announcement ldquoLAC Begins Implementation of New Approach to service Deliveryrdquo Library and Archives Canada February 2012wwwcollectionscanadagccawhats-new013-560-ehtml ltviewed 22May 2012gt For professional responses see the ldquoCHArsquos comment onNew Directions for Library and Archives Canadardquo CHA Website March2012 wwwcha-shccaenNews_39items19html ltviewed 19December 2012gt and ldquoSave Library amp Archives Canadardquo CanadianAssociation of University Teachers launched AprilMay 2012 wwwsavelibraryarchivesca ltviewed 19 December 2012gt

30 Ian Milligan ldquoThe Smokescreen of lsquoModernizationrsquo at Library andArchives Canadardquo ActiveHistoryca 22 May 2012wwwactivehistoryca201205the-smokescreen-of-modernization-at-library-and-archives-canada ltviewed 19 December 2012gt

31 ldquoSave LAC May 2012 Campaign Updaterdquo SaveLibraryArchivesca May2012 wwwsavelibraryarchivescaupdate-2012-05aspx ltviewed 19December 2012gt

32 The challenges and opportunities presented by such projects is extensively discussed in Daniel J Cohen ldquoThe Future of Preserving the Pastrdquo CRM The Journal of Heritage Stewardship 2 no 2

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

(Summer 2005) 6ndash19 also available freely online at the RoyRosenzweig Center for History and New Mediawwwchnmgmueduessays-on-history-new-mediaessaysessayid=39For more see ldquoSeptember 11 Digital Archive Saving the Histories of September 11 2001rdquo Centre for History and New Media andAmerican Social History Project Center for Media and Learningwww911digitalarchiveorg ltviewed 11 January 2013gt ldquoHurricaneDigital Memory Bank Collecting and Preserving the Stories of Katrina and Ritardquo Center for History and New Media wwwhurricanearchiveorg ltviewed 11 January 2013gt and ldquoOccupyArchive Archiving the Occupy Movements from 2011rdquo Centre forHistory and New Media wwwoccupyarchiveorg ltviewed 11 January2013gt

33 Daniel Cohen and Roy Rosenzweig Digital History A Guide to Gathering Preserving and Presenting the Past on the Web(Philadelphia Temple University Press 2005) chapter on ldquoPreservingDigital Historyrdquo The chapter is available fully online atwwwchnmgmuedudigitalhistorypreserving5php ltviewed 19December 2012gt

34 For an overview see ldquoPreserving Our Digital Heritage The NationalDigital Information Infrastructure and Preservation Program 2010 Report a Collaborative Initiative of the Library of CongressrdquoDigitalpreservationgov wwwdigitalpreservationgovmultimediadocumentsNDIIPP2010Report_Postpdf ltviewed 19 December 2012gt

35 Steve Meloan ldquoNo Way to Run a Culturerdquo Wired 13 February 1998available online via the Internet Archivewwwwebarchiveorgweb20000619001705httpwwwwiredcomnewsculture012841030100html ltviewed 19 December 2012gt

36 Some of this has been discussed in greater depth on my website See theseries beginning with Ian Milligan ldquoWARC Files A Challenge forHistorians and Finding Needles in Haystacksrdquo ianmilliganca 12December 2012 wwwianmilliganca20121212warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks ltviewed 20 December2012gt

37 Unstructured data as opposed to structured data is important for histo-rians but brings additional challenges Some projects have producedstructured data novels marked up using XML encoding for example soone can immediately see place names dates character names and soforth This makes it much easier for a computer to read and understand

38 See William Noblett ldquoDigitization A Cautionary Talerdquo New Review ofAcademic Librarianship 17 (2011) 3 Tobias Blanke Michael Bryantand Mark Hedges ldquoOcropodium Open Source OCR for Small-ScaleHistorical Archivesrdquo Journal of Information Science 38 no 1 (2012) 83

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 10: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

reconstructing the European population of Quebec in the seven-teenth and eighteenth centuries drawing heavily on parish registersFurthermore several projects in England have harnessed the wave ofinterest in ldquobig datardquo to tell compelling historical stories about theirpasts the Criminal Intent project for example used the proceedingsof the Old Bailey courts (the main criminal court in Greater London)to visualize large quantities of historical information14 The CriminalIntent project was itself funded by the Digging into Data Challengewhich brings together multinational research teams and fundingfrom federal-level funding agencies in Canada the United States theUnited Kingdom and beyond Several other historical studies havebeen funded under their auspices from studies of commodity tradein the eighteenth- and nineteenth-century Atlantic world (involvingscholars at the University of Edinburgh the University of StAndrews and York University) to data mining the degrees of eco-nomic opportunity and spatial mobility in Britain Canada and theUnited States (bringing researchers at the University of Guelph theUniversity of Leicester and the University of Minnesota together)15

These exciting interdisciplinary teams may be heralding a disciplinalshift although it is still early days

Collaborative teams have limitations though We will have tobe prepared to tackle digital sources alone on shoestring budgetsHistorians will not always be able to call on external programmers orspecialists to do work because teams often develop out of fundinggrants As governmental and university budgets often prove unableto meet consistent research needs baseline technical knowledge isrequired for independent and unfunded work If historians relyexclusively on outside help to deal with born-digital sources the abil-ity to do computational history diminishes As the success of theProgramming Historian indicates and as I will demonstrate in mycase study below historians can learn basic techniques16 Let ustweak our own algorithms

A social or cultural history of the very late-twentieth or twenty-first century will have to account for born-digital sources andhistorians will (in conjunction with archival professionals) have tocurate and make sense of the data Historians need to be able toengage with these sources themselves or at least as part of a team

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

rather than having sources mediated through the lens of outside dis-ciplines or commercial interests Algorithms may be one meaningfulway to make sense of cultural trends or extract meaningful informa-tion from billions of tweets but we must ensure professionalhistorical input goes into their crafting

The Third Wave of Computational History

Historians have fruitfully used computers in the past as part of twoprevious waves The 1960s and 1970s saw computational tools usedfor demographic population and economic histories a trend that inCanada saw full realization in comprehensive studies such as MichaelKatzrsquos The People of Hamilton Canada West17 As Ian Anderson hasconvincingly noted in a review of history and computing studies ofthis sort saw the field become associated with quantitative studies18

After a retreat from the spotlight computational history rose again inthe 1990s with the advent of the personal computer graphical userinterfaces and overall improvements in ease of use Punch cards couldnow give way to database programs GIS and early online networkssuch as H-Net and Usenet We are now I believe on the cusp of athird revolution in computational history thanks to three main fac-tors decreasing storage costs the power of the Internet anddistributed cloud computing and the rise of professionals dealingwith both digital preservation and open-source tools

Decreasing storage costs have led to massive amounts of infor-mation being preserved as Gleick noted in this articlersquos epigraph Itis important to note that this information overload is not newPeople have long worried about the impact of too much informa-tion19 In the sixteenth century responding to the rise of the printingpress the German priest Martin Luther decried that the ldquomultitudeof books [were] a great evilrdquo In the nineteenth century Edgar AllanPoe bemoaned ldquo[t]he enormous multiplication of books in everybranch of knowledge is one of the greatest evils of this agerdquo Asrecently as 1970 American historian Lewis Mumford lamented ldquotheoverproduction of books will bring about a state of intellectual ener-vation and depletion hardly to be distinguished from massiveignorancerdquo20 The rise of born-digital sources must thus be seen in

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

this continuous context of handwringing around the expansion andrise of information

Useful historical information is being preserved at a rate that isaccelerating with every passing day One useful metric to comparedataset size is that of the American Library of Congress (LOC)Indeed a ldquoLOCrdquo has become shorthand for a unit of measurementin the digital humanities Measuring the actual extant of the printedcollection in terms of data is notoriously difficult James Gleick hasclaimed it to be around 10TB22 whereas the Library of Congressitself claims a more accurate figure is about 200TB23 While thesetwo figures are obviously divergent neither is on an unimaginablescale mdash personal computers now often ship with a terabyte or moreof storage When Claude Shannon the father of information theorycarried out a thought experiment in 1949 about items or places thatmight store ldquoinformationrdquo (an emerging field which viewed messagesin their aggregate) he scaled sites from a digit upwards to the largestcollection he could imagine The list scaled up logarithmically from10 to the power of 0 to 10 to the power of 13 If a single-spaced pageof typing is 104 107 the Proceedings of the Institute of Radio Engineers109 the entire text of the Encyclopaedia Brittanica then 1014 was ldquothelargest information stockpile he could think of the Library ofCongressrdquo24

But now the Library of Congressrsquo print collection mdash somethingthat has taken two centuries to gather mdash is dwarfed by their newestcollection of archived born-digital sources the vast majority only amatter of years old compared to the much wider date-range of non-digital sources in the traditional collection The LOC has beguncollecting its own born-digital Internet archive multiple snapshotsare taken of webpages to show how they change over time Even asa selective curated collection drawing on governmental politicaleducational and creative websites the LOC has already collected254TB of data and adds 5TB a month25 The Internet Archivethrough its Wayback Machine is even more ambitious It seeks toarchive every website on the Internet While its size is also hard toput an exact finger on as of late 2012 it had over 10 Petabytes ofinformation (or to put it into perspective a little over 50 Library ofCongresses if we take the 200TB figure)26

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

This is to some degree comparing apples and oranges theLibrary of Congress is predominantly print whereas the InternetArchive has considerable multimedia holdings which includesvideos images and audio files The LOC is furthermore a morecurated collection whereas the Internet Archive draws from a widerrange of producers The Internet offers the advantages and disadvan-tages of being a more democratic archive For example a Geocitiessite (an early way for anybody to create their own website) created bya 12-year old Canadian teenager in the mid-1990s might be pre-served by the Internet Archive whereas it almost certainly would notbe saved by a national archive This difference in size and collectionsmanagement however is at the root of the changing historiansrsquotoolkit If we make use of these large data troves we can access a newand broad range of historical subjects

Conventional sources are also part of this deluge Google Bookshas been digitizing the printed word and it currently has 15 millionworks completed by 2020 it audaciously aims to digitize every bookever written While copyright issues loom the project will allowresearchers to access aggregate word and phrase data as seen in theGoogle n-gram project27 In Canada Library and Archives Canada(LAC) has a more circumscribed digitization project collecting ldquoarepresentative sample of Canadian websitesrdquo focusing particularlyon Government of Canada webpages as well as a curated approach

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

preserving case studies such as the 2006 federal election Currentlyit has 4TB of federal government information and 7TB of totalinformation in its larger web archive (a figure which as I note belowis unlikely to drastically improve despite ostensible goals of archivallsquomodernizationrsquo)28

In Canada LAC has mdash on the surface mdash has been advancingan ambitious digitization program as it realizes the challenges nowfacing archivists and historians Daniel J Caron the Librarian andArchivist of Canada has been outspoken on this front with severalpublic addresses on the question of digital sources and archivesThroughout 2012 primary source digitization at LAC preoccupiedcritics who saw it as a means to depreciate on-site access TheCanadian Association of University Teachers the Canadian HistoricalAssociation as well as other professional organizations mountedcampaigns against this shift in resources29 LAC does have a pointhowever with their modernization agenda The state of Canadarsquosonline collections are small and sorely lacking when compared totheir on-site collections and LAC does need to modernize and achievethe laudable goal of reaching audiences beyond Ottawa

Nonetheless digitization has unfortunately been used as asmokescreen to depreciate overall service offerings30 The modern-ization agenda would also see 50 percent of the digitization staffcut31 LACrsquos online collections are small they do not work withonline developers through Application Programming Interfaces(APIs a way for a computer program to talk to another computerand speed up research mdash Canadianaca is incidentally a leader in thisarea) and there have been no promising indications that this state ofaffairs will change Digitization has not been comprehensive Onlinepreservation is important and historians must fight for both on-siteand on-line access

A number of historians have recognized this challenge and areplaying an instrumental role in preserving the born-digital past TheRoy Rosenzweig Center for History and New Media at GeorgeMason University launched several high-profile preservation pro-jects the September 11th Digital Archive the Hurricane DigitalMemory Bank (focusing on Hurricanes Katrina and Rita) and as ofwriting the Occupy archive32 Massive amounts of online content

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

is curated and preserved photographs news reports blog posts andnow tweets These complement more traditional efforts of collectingand preserving oral histories and personal recollections which arethen geo-tagged (tied to a particular geographical location) tran-scribed and placed online These archives serve a dual purposepreserving the past for future generations while also facilitating easydissemination for todayrsquos audiences

Digital preservation however is a topic of critical importanceMuch of our early digital heritage has been lost Many websites espe-cially those before the Internet Archiversquos 1996 web archiving projectlaunch are completely gone Early storage mediums (from tape toearly floppy disks) have become obsolete and are now nearly inac-cessible Compounding this software development has seen earlyproprietary file formats depreciated ignored and eventually madewholly inaccessible33 Fortunately the problem of digital preservationwas recognized by the 1990s and in 2000 the American governmentestablished the National Digital Information Infrastructure andPreservation Program34 As early as 1998 digital preservationistsargued ldquo[h]istorians will look back on this era hellip and see a period ofvery little information A lsquodigital gaprsquo will span from the beginningof the wide-spread use of the computer until the time we eventuallysolve this problem What wersquore all trying to do is to shorten thatgaprdquo To bear home their successes that quotation is taken from anInternet Archive snapshot of Wired magazinersquos website35

Indeed greater attention towards rigorous metadata (which isbriefly put data about data often preserved in plain text format)open-access file formats documented by the InternationalOrganization for Standardization (ISO) and dedicated digitalpreservation librarians and archivists gives us hope for the future36

It is my belief that cloud storage the online hosting of data in largethird-party offsite centres mitigates the issue of obsolete storagemedium Furthermore that the move towards open source file for-mats even amongst large commercial enterprises such as Microsoftprovides greater documentation and uniformity

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A Brief Case Study Accessing and Navigating Canadarsquos DigitalCollections With an Emphasis on Tools for Born-DigitalSources

What can we do about this digital deluge There are no simpleanswers but historians must begin to conceptualize new additions totheir traditional research and pedagogical toolkits In the section thatfollows I will do two things introduce a case study and several toolsboth proprietary and those that are open and available off-the-shelfto reveal one way forward

For a case study I have elected to use Canadarsquos DigitalCollections (CDC) a collection of lsquodeadrsquo websites removed fromtheir original location on the World Wide Web and now digitallyarchived at LAC In this section I will introduce the collection usingdigitally-archived sources from the early World Wide Web and dis-cuss my research methodology before illustrating how we can derivemeaningful data from massive arrays of unstructured data37

This is a case study of a web archive Several of these method-ologies could be applied to other formats including large arrays oftextual material whether born-digital or not A distant reading of thistype is not medium specific It does however assume that the text ishigh quality Digitized primary documents do not always have textlayers and if they are instituted computationally through OpticalCharacter Recognition errors can appear Accuracy rates for digitizedtextual documents can range from as low as 40 percent for seven-teenth and eighteenth century documents to 80ndash90 percent formodern microfilmed newspapers to over 99 percent for typesetword-processed documents38 That said while historians today maybe more interested in applying distant reading methodologies tomore traditional primary source materials it is my belief that in thenot-so-distant future websites will be an increasingly significantsource for social cultural and political historians

What are Canadarsquos Digital Collections In 1996 IndustryCanada funded by the federal Youth Employment Strategy facili-tated the development of Canadian heritage websites by youth(defined as anyone between the age of 15 and 30) It issued contractsto private corporations that agreed to use these individuals as web

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

developers for the twofold purpose of developing digital skills andinstilling respect for and engagement with Canadian heritage andpreservation39 The collection initially called SchoolNet butrebranded as Canadarsquos Digital Collections in 1999 was accessiblefrom the CDC homepage After 2004 the program was shut downand the vast majority of the websites were archived on LACrsquos web-page This collection is now the largest born-digital Internet archivethat LAC holds worthy of study as content as well as thinkingabout methodological approaches

Conventional records about the program are unavailable in thetraditional LAC collection unsurprising in light of both its recentcreation and backlogs in processing and accessioning archival mate-rials We can use one element of our born-digital toolkit the InternetArchiversquos WaybackMachine (its archive of preserved early webpages)to get a sense of what this project looked like in its heyday The mainwebpage for the CDC was wwwcollectionsicgc However a visitortoday will see nothing but a collection of broken image links (two ofthem will bring you to the English and French-language versions ofthe LAC-hosted web archive although this is only discerniblethrough trial and error or through the sitersquos source code)

With the WaybackMachine we have a powerful albeit limitedopportunity to ldquogo back in timerdquo and experience the website as previ-ous web viewers would have (if a preserved website is available ofcourse) Among the over 420 million archived websites we can searchfor and find the original CDC site Between 12 October 1999 andthe most recent snapshot or web crawl (of the now-broken site) on 6July 2011 the site was crawled 410 times40 Unfortunately every siteversion until the 15 October 2000 crawl has been somehow cor-rupted Content has been instead replaced by Chinese characterswhich translate only into place names and other random words Issuesof early digital preservation noted above have reared their head inthis collection

From late 2000 onward however fully preserved CDC sites areaccessible Featured sites new entries success stories awards curric-ular units and prominent invitations to apply for project fundingpopulate the brightly coloured website The excitement is palpableJohn Manley then Industry Minister declared ldquoit is crucial that

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Canadarsquos young people get the skills and experience they need to par-ticipate in the knowledge-based economy of the futurerdquo and thatCDC ldquoshows that young people facing significant employment chal-lenges can contribute to and learn from the resources available onthe Information Highwayrdquo41 The documentation reflecting thelower level of technical expertise and scarcity of computer hardwareat the time was comprehensive participants were encouraged toidentify a collection digitize it process it (by cleaning up the data)preserve it on a hard drive and display it on the web Issues of copy-right were discussed as well as preservation issues and hardwarerequirements for computers (a computer with an Intel 386 processor8MB RAM a 1GB hard drive and 288bps modem was recom-mended) and training information for employers educators andparticipants42 Sample proposals community liaison ideas amidstother information set out a very comprehensive roadmap for a grass-roots bottom-up approach to digital preservation and heritage

Moving through the preserved websites is an interesting snap-shot at how web design standards changed Colour disappearsreplaced by a more subdued coordinated federal website standard(still largely in effect today) Surveys greet viewers from 2003onwards trying to learn how viewers discovered the site whetherthey think it is a good educational resource (they must have beenrelieved that the majority thought so mdash in 2003 some 84 percentfelt that it was a ldquogoodrdquo resource the best of its type)43 By 2004 theprogram was engaging in active remembrance of its past programsas it neared its ten-year anniversary Proposals stopped beingaccepted in July 2004 but the 600 webpages were still accessiblewith a comprehensive subject index alphabetical listing success sto-ries and detailed copyright information

The Canadarsquos Digital Collections site began to move towardsremoval in November 2006 and was initially completely gone A visi-tor to the site would learn that the content ldquois no longer availablerdquo andan e-mail address was provided for contributors to get their informationback44The information was completely inaccessible and by early 2007a bold notice announced that the now-empty site would close as of 2April 2007 Luckily by April 2007 a notice appeared with a declara-tion that the collections were now archived by Library and Archives

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Canada just in time for the website to completely disappear by July2007 In the archived version subject indexes and headings were goneAll that remained then and today for that matter is a straight-forwardalphabetical list and a disclaimer that noted ldquo[y]ou are viewing a doc-ument archived by Library and Archives Canada Please noteinformation may be out of date and some functionality lostrdquo

The question now is what we can do with this collection ofwebsites Could a distant reading as opposed to a close reading ona targeted level tell us something about this collectionFortunately we have a variety of tools that can help us navigate thislarge array of information several of which I use to navigate largeamounts of data First I create my own tools through computerprogramming This work is showcased in the following section I program in Wolfram Researchrsquos Mathematica 9 Mathematica is anintegrated platform for technical computing that allows users toprocess visualize and interact with exceptional arrays of informa-tion45 It is a language with an engaged and generous user base thatcomplements exhaustive well-written documentation fromWolfram Research To emphasize its ease one line of code couldimport a website in perfectly readable plaintext (avoiding unicodeerrors) as follows input=Import[ldquohttpactivehis-torycaldquordquoPlaintextrdquo] and a second one could make itall in lower text for textual analysis as such lowerpage=ToLowerCase[input] Variables are a critical building blockIn the former example ldquoinputrdquo now holds the entire plaintext of thewebsite ActiveHistoryca and then ldquolowerpagerdquo holds it in lowercaseOnce we have lower-cased information we are quickly able to extractthe following frequent words frequent phrases and could thus cre-ate a dynamic database of n-grams word clouds or run other textualanalysis programs on the material Much of the syntax adheres to arecognizable English-language format

Even more promising is Mathematicarsquos integration with theWolfram|Alpha database (itself accessible at wolframalphacom) acomputational knowledge engine that promises to put incredibleamounts of information at a userrsquos fingertips With a quick com-mand one can access the temperature at Torontorsquos Lester B PearsonInternational Airport at 500 pm on 18 December 1983 the rela-

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

tive popularity of names such as ldquoEdithrdquo or ldquoEmmardquo the Canadianunemployment rate in March 1985 or the Canadian GrossDomestic Product in December 199546 This allows one to quicklycross-reference findings with statistical information or find coinci-dental points of comparison As universities are now being activelyencouraged to contribute their information to Wolfram|Alpha thiswill become an increasingly powerful database

For many researchers the downside of Mathematica is a con-siderable one It is proprietary software and relatively expensive forstudent and non-student alike Outputs however can be dissemi-nated using the Computational Document Format (CDF) enablinghistorians to put their work online and entice the general public todynamically work with their models and findings

Programming is not the be all and end all Information visual-ization in the humanities is a growing international field withCanadian scholars heavily involved The most important and acces-sible example is Voyant Tools47 It is a powerful suite of textualanalysis tools Uploading onersquos own text or collection of texts draw-ing on a webpage or analyzing existing bodies of work one sees thefollowing a ldquoword cloudrdquo of a corpus click on any word to see itsrelative rise and fall over time (word trends) access the original textand click on any word to see its contextual placement within indi-vidual sentences It ably facilitates the ldquodistancerdquo reading of anyreasonably-sized body of textual information and should be a begin-ning point for those interested in this field The only downside isthat Voyant can choke on very large amounts of information as it isa textual analysis rather than robust data mining tool

There are several other complementary tools A notable one is theMany Eyes website hosted by IBM Research Many Eyes allows you toupload datasets be they textual tabular or numerical There are manyvisualization options provided you can plot relationships betweendata compare values (ie bar charts and bubble charts) track dataover time and apply information to maps The only downside toMany Eyes is that you must register and make your datasets available toall other registered researchers as well as IBM This limits your abilityto use copyrighted or proprietary information

More advanced technical users can avail themselves of a host of

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

textual analysis programs found elsewhere on the web The mostcomprehensive gathering of these tools can be found with theSoftware Environment for the Advancement of Scholarly Research(SEASR) and its Meandre workbench Meandre which can beinstalled as a local server on your system or mdash in a best case scenariomdash hosted by an institutional IT department allows you to tap intoa variety of workflows to analyze data48 One workflow mentionedin this article is the topic modeling toolkit found within the MachineLearning for Language Toolkit (MALLET) package which clustersunstructured data into ldquotopicsrdquo which appear together giving a quicksense of a documentrsquos structure49 Other workflows found withinSEASR include Dunningrsquos log-likelihood statistic which compares aselected document to a broader reference corpus For example youcould compare the lyrics that appear in 1968 against all lyrics in thepostwar era and learn what makes 1968 most unique50 sentimentanalyses which attempt to plot changes in emotional words andphrases over time and a variety of entity extraction algorithms(which find ldquopeoplerdquo ldquoplacesrdquo and so forth in a text) Corpuses thatwere previously too big to extract meaningful information from cannow be explored relatively quickly

As befits a technical side to history all of the freely availabletools discussed briefly here (notably wget SEASR and MALLET)are discussed in more technical terms in the peer-reviewedProgramming Historian 2 or on my own blog at ianmilliganca I haveelected to leave them online rather than provide commands in thisarticle they require some technical knowledge and directly workingfrom the web is quicker than here Please consider the following sec-tion an applied version of some of the lessons you can learn throughthese processes

From Web Collection to Dynamic Finding Aids

Let us briefly return to our archival metaphor Todayrsquos historiandealing with conventional sources often has to physically travel to anarchive requisition boxes perhaps take photographs of materials iftime is tight and otherwise take notes With born-digital sourceshowever the formula is reversed Our sources can come to us In a

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

few keystrokes we can bring an unparalleled quantity of informationto our home or work computers With Canadarsquos Digital Collectionsa single command line entry brought the entire archive onto my sys-tem it took days to process the information and I had to leave mycomputer on but it did not require any active user engagement

How big is the CDC collection in terms of sheer informationThe plaintext without any HTML markup is approximately7805135 words If each page contained 250 words it would be theequivalent of a 31220 page essay or 1000 dissertations However itis even more complicated than just reading 1000 dissertations Theinformation is spread across 73585 separate HTML files themselvespart of a much broader corpus of 278785 files Even the sheer sizeof the database requires big data methodologies an aggregate size of726 GB 360 MB of plaintext with HTML encoding reduced to alsquomerersquo 50MB if all encoding is removed A 50MB plain text docu-ment can quickly slow down even a relatively powerful personalcomputer There is no simple way to make this a conventionallyaccessible amount of information

How can we access this information Several different methodsare possible First we can work with the information on the internetitself or we can download it onto our own system I prefer the lattermethod as it decreases computational time adds flexibility andenables free rein to experiment with material Before you do thisthough you need to be familiar with the websitersquos ldquoterms of servicerdquoor the legal documentation that outlines how users are allowed to usea website resources or software In this case LAC provides generousnon-commercial reproduction terms information ldquomay be repro-duced in part or in whole and by any means without charge orfurther permission unless otherwise specifiedrdquo51

The CDC web archive index is organized like so the index islocated at wwwepelac-bacgcca100205301iccdcEAlphabetaspwith every subsequent website branching off the cdc folder Forexample one website would be hosted in wwwepelac-bacgcca100205301iccdcABbenevolat the next in hellipcdcabnatureand so on Using wget a free software program you can downloadall websites in a certain structure the program recursively downloadsinformation in a methodical fashion I first used wget to save the

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

cookies (little files kept on your system in this case one that notedthat you had seen the disclaimer about the archival nature of thewebsite) and then with an one-line command in my command lineto download the entire database52 Due to the imposition of a ran-dom wait and bandwidth limit which meant that I would downloadone file every half to one and a half seconds at a maximum of 30kilobytes a second the process took six days In the ProgrammingHistorian 2 I have an article expanding upon wget for humanitiesresearchers

Hypertext itself deserves a brief discussion The basic buildingblock of the web is the link one page is connected to another andpeople navigate through these webs While books and articles forexample are connected to each other by virtue of ideas and concretecitations these must often be inferred online HTML born-digitalsources are different in that the link offers a very concrete file One

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Visualizing the CDC Collection as a Series of Hyperlinks

quick visualization the circular cluster seen above lets us see theunique nature of this information We can go to each web pagescrape the links and see where they take viewers As we see in thebelow figure the CDC collection can be visualized as a series ofhyperlinks (each line) The cluster in the middle is the CDC collec-tion itself whereas the strands on the outside represent externalwebsites Using Mathematica I have generated a dynamic modelyou can move your mouse over each line learn what link it is andthus see where the collection fits into the broader Internet The siteis limited to following two links beyond the first page Through thisapproach we see the interconnection of knowledge and relationshipsbetween different sources At a glimpse we can see suggestions to fol-low other sites and determine if any nodes are shared amongst thevarious websites revealed

Visualization Moving Beyond Word Clouds

Where to begin with a large collection such as the Canadarsquos DigitalCollection corpus As a historian who is primarily interested in thewritten word my first step was to isolate the plaintext throughout thematerial so that I could begin rudimentary data mining and textualanalysis on the corpus For this I approached the CDC collection ontwo distinct levels mdash first the aggregate collection and second divid-ing the collection into each lsquofilersquo or individual website mdash derivingword and phrase data for each Imagine if you could look throughinto an archival collection and get a quick sense of what topics andideas were covered Taking that thought further imagine the utility ifyou could do this with individual archival boxes and even files Witha plethora of born-digital sources this is now a necessary step

On an aggregate level word and phrase data is a way to explorethe information within How often do various words within thenearly eight million words of the collection appear UsingMathematica we can quickly generate a list of words sorted by theirrelative frequency we learn that the word ldquoCanadardquo appears 26429times ldquocollectionsrdquo 11436 and ldquo[L]ouisbourgrdquo an astounding9365 times Reading such a list can quickly wear out a reader and isof minimal visual attractiveness Two quick decisions were made to

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

enhance data usability first the information was ldquocase-foldedrdquo allsent to lower-case this enables us to count the total appearances of aword such as ldquodigitalrdquo regardless of whether or not it appears at thebeginning of a sentence

Second when dealing with word frequency mdash as opposed tophrase frequency mdash ldquostop wordsrdquo should be removed as they werehere These are ldquoextremely common words which would appear to beof little value in helping select documents matching a user needrdquowords such as ldquotherdquo ldquoandrdquo ldquotordquo and so forth They matter forphrases (the example ldquoto be or not to berdquo is comprised entirely ofstop words) but will quickly dominate any word-level visualizationor list53 Other words may need to be added or removed from stoplists depending on the corpus for example if the phrase ldquopage num-berrdquo appeared on every page that would be useless information forthe researcher

The first step then is to use Mathematica to format word fre-quency information into Wordlenet readable format (if we put allof the text into that website it will crash mdash so we need to followthe instructions on that website) we can then draw up a wordcloud of the entire collection Here we can see both the advantagesand disadvantages of static word clouds What can we learn from this

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A Word Cloud of All Words in the CDC Corpus

illustration First without knowing anything about the sites we canimmediately learn that it is dealing with CanadaCanadians that itis a collection with documents in French and English that we aredealing with digitized collections and that there is a fair amount ofregional representation not based upon population (note thatldquoontariordquo is less frequent than both ldquonewfoundlandrdquo ldquonova scotiardquoand ldquoalbertardquo although ldquoqueacutebecrdquo and ldquoquebecrdquo is not terribly promi-nent) Certain kinds of history dominate communities familynames the built form (houses churches schools and so forth) At avery macro level we can learn several characteristics before doing anydeeper investigation

These forms of visualization need to be used with extreme cau-tion however First while visually attractive it leaves a severeamount of ambiguity are ldquoontariordquo and ldquoqueacutebecrdquo equally promi-nent for example or is there a slight divergence Even moreproblematically we cannot learn context from the above picturedoes the word ldquoschoolrdquo refer to built structures or educational expe-riences The same question with applies to ldquochurchrdquo Do mentionsof ldquonewfoundlandrdquo refer to historical experiences there or withexpatriates or in travel narratives Word clouds are useful first stepsand have their place but need to be used with extreme caution Formore scholarly applications tables offer much more precise inter-pretation As the second figure below demonstrates they can bemore helpful when dealing with the individual sites cuttingthrough ambiguity The figure below is a visualization of the web-site ldquopast and presentrdquo Despite the ambiguous title we can quicklylearn what this website contains from this visualization Wordle isthus a useful first step analogous to taking magically taking a quicklook through the side of an archival box We still need more infor-mation however to really make this form of textual analysishelpful To navigate this problem I have created my own dynamicfinding aid program using Mathematica which scales to any born-digital web collection

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Two approaches were taken First a textual representation of wordand phrase frequency allows us to determine the content of a given filerelatively quickly I wrote a short program to extract word frequency(stop words excluded) and phrase frequency For phrases we speak ofn-grams bigrams trigrams quadgrams and fivegrams although onecould theoretically do any higher number A bigram can be a combina-tion of two characters such as lsquobirsquo two syllables or mdash pertinent in ourcase mdash two words An example ldquocanada isrdquo A trigram is three wordsquadgram is four words and a fivegram is five words and so on

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Word Cloud of the File ldquoPast and Presentrdquo

A viewer for n-grams in the CDC Collection

The CDC collection demonstrates this approachrsquos utility Usingmy Mathematica textual CDC browser pictured above I randomlyexplored file 267 the ambiguously named ldquopast and presentrdquo web-site This is one of the largest individual websites 122363 words orabout 493 standard typed pages Such an ambiguous name could beabout almost anything Initially I decided to study word frequency

ldquoldquoalbertardquo 758rdquo ldquoldquolistenrdquo 490rdquo ldquoldquosettlersrdquo 467rdquoldquoldquonewrdquo 454rdquo ldquoldquocanadardquo 388rdquo ldquoldquolandrdquo 362rdquo ldquoldquoreadrdquo328rdquo ldquoldquopeoplerdquo 295rdquo ldquoldquosettlementrdquo 276rdquo ldquoldquochineserdquo268rdquo ldquoldquocanadianrdquo 264rdquo ldquoldquoplacerdquo 255rdquo ldquoldquoheritagerdquo254rdquo ldquoldquoedmontonrdquo 250rdquo ldquoldquoyearsrdquo 238rdquo ldquoldquowestrdquo 236rdquoldquoldquoraymondrdquo 218rdquo ldquoldquoukrainianrdquo 216rdquo ldquoldquocamerdquo 213rdquoldquoldquocalgaryrdquo 209rdquo ldquoldquonamesrdquo 206rdquo ldquoldquoranchrdquo 204rdquoldquoldquoupdatedrdquo 193rdquo ldquoldquohistoryrdquo 190rdquo ldquoldquocommunitiesrdquo183rdquo hellip

Rather than having to visit the website at a glance we can almostimmediately see what the site is about More topics become apparentwhen one looks at trigrams for example in this automatically gen-erated output of generated top phrases

ldquoldquofirstrdquo ldquopeoplerdquo ldquosettlersrdquo 100rdquo ldquoldquotherdquo ldquounitedrdquoldquostatesrdquo 72rdquo ldquoldquoonerdquo ldquoofrdquo ldquotherdquo 59rdquo ldquoldquoofrdquo ldquotherdquoldquowestrdquo 58rdquo ldquoldquosettlersrdquo ldquolastrdquo ldquoupdatedrdquo 56rdquo ldquoldquopeoplerdquoldquosettlersrdquo ldquolastrdquo 56rdquo ldquoldquoalbertansrdquo ldquofirstrdquo ldquopeoplerdquo 56rdquoldquoldquoadventurousrdquo ldquoalbertansrdquo ldquofirstrdquo 56rdquo ldquoldquotherdquo ldquobarrdquoldquourdquo 55rdquo ldquoldquoopeningrdquo ldquoofrdquo ldquotherdquo 54rdquo ldquoldquocommunitiesrdquoldquoadventurousrdquo ldquoalbertansrdquo 54rdquo ldquoldquonewrdquo ldquocommunitiesrdquoldquoadventurousrdquo 54rdquo ldquoldquobarrdquo ldquourdquo ldquoranchrdquo 53rdquo ldquoldquolistenrdquoldquotordquo ldquotherdquo 52rdquo hellip

Through a few clicks we can navigate and determine the basiccontours mdash from a distance mdash of what the site pertains to In aviewer you can use a slider to dynamically see more informationor less information quickly toggling between bigrams trigramsand quadgrams allowing you to quickly hone in on immediately

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

relevant information This can be useful but we need to moveback even further to achieve a distant reading of this born-digitalcollection

If one zooms back out to this very distant level we can extractn-gram data from the entire collection We have two approachesfirst aggregate appearances of a given term (if ldquoaboriginalrdquo appears457 times then n=457) or second relative appearances of a giventerm (if ldquoaboriginalrdquo appears in 45 of 384 sites then n=45384 or01171875) In my other work on music lyrics the latter optionenables you to control for the lsquochorusrsquo effect as the former can be drowned out by one song with repetitive words Given thenature of this collection I elected for the former approach Whatcan we learn from trigrams for example After an initial set of trigrams pertaining to the Louisbourg digitization project whichhas boilerplate language on each page as well as common ldquostop-wordrdquo equivalents of phrases (especially pronounced in French) we come across the first set of meaningful phrases excerptedbelow

ldquoexperiencerdquo ldquoinrdquo ldquotherdquo 919 ldquotransferredrdquo ldquotordquo ldquotherdquo918 ldquofundedrdquo ldquobyrdquo ldquotherdquo 918 ldquoofrdquo ldquotherdquo ldquoprovincialrdquo915 ldquothisrdquo ldquowebrdquo ldquositerdquo 915 ldquo409rdquo ldquoregistryrdquo ldquo2rdquo914 ldquoarerdquo ldquonordquo ldquolongerrdquo 914 ldquogouvernementrdquo ldquodurdquo ldquocanadardquo 910 ldquouserdquo ldquothisrdquo ldquolinkrdquo 887ldquopagerdquo ldquoorrdquo ldquoyourdquo 887 ldquofollowingrdquo ldquopagerdquo ldquoorrdquo 887 ldquoautomaticallyrdquo ldquotransferredrdquo ldquotordquo 887 ldquoberdquoldquoautomaticallyrdquo ldquotransferredrdquo 887 ldquowillrdquo ldquoberdquo ldquoautomaticallyrdquo 887 ldquoindustryrdquo ldquoyourdquo ldquowillrdquo 887ldquomultimediardquo ldquoindustryrdquo ldquoyourdquo 887 ldquopracticalrdquo ldquoexperiencerdquo ldquoinrdquo 887 ldquogainedrdquo ldquopracticalrdquo ldquoexperiencerdquo887 ldquotheyrdquo ldquogainedrdquo ldquopracticalrdquo 887 ldquowhilerdquo ldquotheyrdquoldquogainedrdquo 887 ldquocanadiansrdquo ldquowhilerdquo ldquotheyrdquo 887ldquoyoungrdquo ldquocanadiansrdquo ldquowhilerdquo 887 ldquoshowcaserdquo ldquoofrdquoldquoworkrdquo 887 ldquoexcellentrdquo ldquoshowcaserdquo ldquoofrdquo 887 hellip

Here we see an array of common language funding informa-tion link information project information spread across over 887

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

websites Reading down we see information about governments jobexperience project information initiatives re-structuring and soforth At a glance we can see what language is common to all web-sites as opposed to unique information Moving down we find themore common but less universal ldquothe history ofrdquo for example ldquoher-itage community foundationrdquo libraries and so forth By the end ofthe list we can move into the unique or very rare phrases that appearonly once twice or three times

Moving beyond word and phrase frequency there are a fewmore sophisticated off-the-shelf options that can help us make senseof large arrays of unstructured data Topic modeling is among themost developed as Princetonrsquos David M Blei (a Computer Scienceprofessor) describes in a general introduction to the topic

While more and more texts are available online we simplydo not have the human power to read and study them toprovide the kind of [traditional] browsing experiencedescribed above To this end machine learning researchershave developed probabilistic topic modeling a suite of algo-rithms that aim to discover and annotate large archives ofdocuments with thematic information Topic modelingalgorithms are statistical methods that analyze the words ofthe original texts to discover the themes that run throughthem how those themes are connected to each other andhow they change over time hellip Topic modeling enables usto organize and summarize electronic archives at a scalethat would be impossible by human annotation55

Using the aforementioned SEASR Meandre workbench (or viaMALLET itself and a command-line interface) we can quickly beginto run our born-digital information through the topic-modelingalgorithm The results are notable and come in two formats a wordcloud visualization as well as an XML file containing a line-by-linebreakdown of each topic Taking the aforementioned file 267 (ldquopastand presentrdquo) we can begin to get a sense of the topics this large web-site discusses

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

At a glance we can learn about some of the main themes the struc-ture of Alberta (Alberta towns bridges lakes) the geographic shapeof the area (life land lands areas links grass and interestingly lan-guage) the stories (story knight archives but also interestinglyknight and wife) and so forth The word cloud gives us a sense ofthe most relevant terms that define each topic but we also have a3200-line XML file broken into topics which lets us further refineour inquiry

A last set of examples will help us move from the aggregate levelof hundreds of thousands of lines of text to the individual line inessence creating our own search engine to navigate this collection

There are a few good reasons why we should not use Google todo all of this in our own work First we do not know how Googleworks precisely It is a proprietary black box While the original algo-rithm is available the actual methods that go into determining theranking of search terms is a trade secret Second Google is optimizedfor a number of uses mainly to deliver the most relevant informa-tion to the user Historians and archivists are looking often forspecific information within old websites and it behooves us to exam-ine them through the lens of more specialized tools With thesehesitations in mind I have sought to develop my own rudimentarysearch engine for this born-digital collection

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A two-pronged approach can take us to relevant informationquickly In the following example I am carrying out a project onblack history We quickly search all of the CDC word frequency filesfor the occurrence of the word ldquofreedomrdquo ldquoblackrdquo and ldquolibertyrdquo(these terms of course require some background knowledge thesearch is shaped by the user) obtain a list and then map it onto achart in order to quickly visualize their locations in the collection Forexample in the graph above we see a visualization of a given term inthis case our three search terms As one moves the mouse over eachbar the frequency data for a given term appears We can also zoomand enhance this graph to use it as we wish A demonstration of thiscan be found online at wwwianmilliganca20130118cdc-tool-1

We can then move down into the data In the example aboveldquo047rdquo labeled ldquoblackgoldtxtrdquo contains 31 occurrences of the wordldquoontariordquo Moving to my Mathematica keyword-in-context routinewe can see where these occurrences of the word appear

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

A snapshot of the dynamic finding aid tool

We see that this website focuses mostly on southwestern Ontariowhere many oil fields are If we move forward we can see that it per-tains to the nineteenth century a particular region as well assubsequent memorialization activities (mention of a museum forexample) All this can be done from within the textual client asopposed to going out into individual websites Broad contoursemerge from this distant reading but they also facilitate deeper read-ing by allowing us to promptly realize critical sources

Another model through which we can visualize documents isthrough a viewer primarily focusing on term document frequencyand inverse frequency terms discussed below To prepare this mater-ial we are combining our earlier data about word frequency (that theword ldquoexperiencerdquo appears 919 times perhaps) with aggregate infor-mation about frequency in the entire corpus (ie the wordldquoexperiencerdquo appears 1766 times) The number of times that a termappears in a given website can be defined as tf (term frequency)Another opportunity is inverse document frequency where we canlearn what words are most distinct and unique in a given documentThe formula for this is the oft-cited tf-idf a staple of informationretrieval theory or tf times the logarithm of the number of docu-ments divided into the number of documents where the given termappears At a glance then we can see some of the unique words con-tained within a website if we select ldquoinverse document frequencyrdquo

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

At a very quick glance one can see the major words that make this aunique document This technique can be used on any textual docu-ment

Off-the-shelf tools can facilitate a deeper yet distant reading ofindividual websites as well especially the suite of research tools onwebsites such as Many Eyes and Voyant Using these tools howeverrequires that one puts individual data into each separate programrather than quickly navigating between them This stems from themassive amount of files and data involved Many Eyes limits files to5MB and Voyant has no hard and fast rule but can struggle withlarger files

This analysis was carried out on my personal computer55 Largerdata sets may necessitate a corresponding increase in computationalpower bringing social sciences and humanities researchers into con-tact with High Performance Computing (HPC) centres or consortiaThese extremely powerful computers provide researchers with com-puting power orders of magnitude above that of personal computersallowing results in hours instead of the weeks or months it might

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

take to process on a smaller system56 This has particular applicationin the basic processing of a very large web archive There are limita-tions however jobs must be submitted into queues in advancerather than on-site processing and perhaps most importantly HPCcentres prefer to run highly optimized code to minimize usage oftheir very expensive and scarcely allocated equipment As this level ofcoding expertise is often beyond humanities researchers researchteams must be formed

Humanities researchers would do well to familiarize themselveswith local HPC options many of which are more than happy towork on new and engaging questions These are accessible optionsin Ontario for example all but one university belongs to a HPCconsortium57 Yet as personal computing continues to dramaticallyincrease in speed and as storage prices continue to plummet textualanalysis is increasingly feasible on high-quality office systems An ini-tial exposure can generate questions and problems that if the needrears itself can subsequently be taken to the supercomputer

This case study is intended to provide one model for distantreading and the generation of finding aids in the interest of showingsome of the potential for working with born-digital sources It isintended to begin or jumpstart a conversation around the resourcesand skills necessary for the next generation A seemingly insur-mountable amount of information contained across over 73000files (as in the CDC collection) can be quickly navigated and iso-lated for researcher convenience

Conclusion Digital Literacy for the Next Generation ofHistorians

As a discipline I would like to suggest one path forward prepare thenext generation of historians at least some of them to deal with thedeluge of born-digital sources Fundamentally the challenge of born-digital sources is a pedagogical one For existing historians this is anissue of continual education for future historians an issue of profes-sional training (the annual Digital Humanities Summer Instituteheld at the University of Victoria each June is worth considering asan introduction to many digital concepts) Some off-the-shelf tools

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

can serve as gateways to deeper engagement notably the VoyantTools suite (wwwvoyant-toolsorg) and perhaps subsequently thelessons contained in the Programming Historian 2 That said digitalliteracy courses should become part of the essential pantheon of his-torical curricula This should not be mandatory but instead taughtalongside and within traditional historiography and research meth-ods courses In an era of diminishing resources and attention paid toknowledge translation and relevancy digital literacy courses are alsowell positioned to provide graduates with concrete portfolios andactionable skills If we want to be prepared to use digital method-ologies to study the post-1990 world we will need ready historians

Such a course should have six basic priorities First we must pro-vide historiographical context for new students on whatcomputational methods have been tried before How have quantita-tive and qualitative methods differed yet ultimately complementedeach other Second students in this course would need to develop anunderstanding of the digital basics cloud computing effectivebackup and versioning digital security and an understanding of howthe Internet is organized Third for effective cross-departmental andsustained engagement with digital history students must to learn howto digitally navigate conventional sources notable skill sets wouldinclude a familiarity with citation management software databasesand the possibility of using optical character recognition on a varietyof born-paper sources subsequently digitized through scanners orcameras This must also include a grasp of the scope and significanceof large digital depositories including Google Books the InternetArchive and others mentioned earlier Fourth students must learnhow to process large amounts of digital information this includes off-the-shelf visualization tools material discussed here and some of thephilosophical and methodological implications of a distance readingFifth students can learn basic programming through the ProgrammingHistorian Finally and perhaps most importantly students need anunderstanding of various ways to openly disseminate material on the web whether through blogging micro-blogging publishing webpages and so forth it is important to enable the great work that we do in the university to make its mark on the world Throughthis students will also come away from history departments with

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

comprehensive and marketable digital portfolios Such a coursewould focus more on homework than reading with special emphasisplaced on deliverable content such as websites interactive modelsand broadly implementable databases

Digital history and the readiness to work with born-digitalsources cannot stem out of pedagogy alone however Institutionalsupport is necessary While based on anecdotal evidence it appearsthat some university administrators have begun to embrace the shifttowards the digital humanities At the federal level SSHRC has iden-tified the Digital Economy as one of its priority areas58 Yet when itcomes to evaluation doctoral and post-doctoral researchers do nothave a ldquodigital humanitiesrdquo option open to them in terms of assessingcommittees leading to fears that their projects might be lost inbetween disciplinary areas of focus59 In light of the impending crisisof born-digital sources historians should encourage these initiatives

These sorts of digital projects do not always lend themselveswell to traditional publication venues The examples above from theCDC case study for example are best viewed as dynamically mal-leable models this will be one challenge as historians grapple withnew publication models This article argues first why digitalmethodologies are necessary and second how we can realize aspectsof this call The Internet is 20 years old and before historians realizeit we will be faced with the crisis of born-digital information over-load If history is to continue as the leading discipline inunderstanding the social and cultural past decisive movementtowards the digital humanities is necessary When the sea comes inwe must be ready to swim

IAN MILLIGAN is an assistant professor of Canadian history at theUniversity of Waterloo He is a founding co-editor of ActiveHistorycaan editor-at-large of the Programming Historian 2 and has written sev-eral articles in the fields of Canadian youth digital and labour history

IAN MILLIGAN est professeur adjoint en histoire canadienne agrave laUniversity of Waterloo Il est le fondateur et codirecteur du site

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Internet ActiveHistoryca ainsi qursquoun eacutediteur du site ProgrammingHistorian 2 Il a eacutecrit plusieurs articles au sujet de la jeunesse cana-dienne de la numeacuterisation et de lrsquohistoire des travailleurs

Endnotes

1 The question of when the past becomes a ldquohistoryrdquo to be professionallystudied is a tricky one Yet it is worth noting that the still-definitivework on the Baby Boom Generation appeared in 1996 with DougOwram Born at the Right Time A History of the Baby Boom Generationin Canada (Toronto University of Toronto Press 1996) This actuallyfollowed the earlier scholarly monograph Cyril Levitt Children ofPrivilege Student Revolt in the Sixties (Toronto University of TorontoPress 1984) and a host of other laudable works As academicdebatesworks on the 1960s now suggest the past becomes historysooner than we often expect

2 A critical point advanced in the field of literary criticism and theory byStephen Ramsay Reading Machines Towards an Algorithmic Criticism(Urbana University of Illinois Press 2011)

3 See ldquoIdleNoMore Timelinerdquo ongoingwwwidlenomoremakookcatimeline44 ltviewed 18 January 2013gt

4 These issues have been discussed elsewhere albeit in either a moreabridged form or with a focus on digitized historical sources A notableexception is Daniel J Cohen and Roy Rosenzweig Digital History AGuide to Gathering Preserving and Presenting the Past on the Web(Philadelphia University of Pennsylvania Press 2005) specifically theirchapter on ldquoCollecting History onlinerdquo Their focus is on the preservation-ist and archival challenges however Important notes are also found inDan Cohen ldquoHistory and the Second Decade of the Webrdquo RethinkingHistory 8 no 2 (June 2004) 293ndash301 also available online atwwwchnmgmueduessays-on-history-new-mediaessaysessayid=34

5 Figure taken from Jose Martinez ldquoTwitter CEO Dick Costolo Reveals Staggering Number of Tweets Per Dayrdquo ComplexTech wwwcomplexcomtech201210twitter-ceo-dick-costolo-reveals-staggering-number-of-tweets-per-day ltviewed 6 January 2013gtFurther more while a tweet itself is only 140 characters every tweetcomes with over 40 lines of metadata date information author biogra-phy and location information when the account was created the usersrsquofollowing and follower numbers country data and so forth This is aremarkable treasure trove of information for future historians but also apotential source of information overload See Sarah Perez ldquoThis is What

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

a Tweet Looks Likerdquo ReadWriteWeb (19 April 2010) wwwreadwritewebcomarchivesthis_is_what_a_tweet_looks_likephpltviewed 6 January 2013gt

6 This is further discussed in Ian Foster ldquoHow Computation ChangesResearchrdquo in Switching Codes Thinking Through Digital Technology inthe Humanities and the Arts eds Thomas Bartscherer and RoderickCoover (Chicago University of Chicago Press 2011) esp 22ndash4

7 Franco Moretti Graphs Maps Trees Abstract Models for Literary History(New York Verso 2005) 3ndash4

8 While not of prime focus here the longue dureacutee deserves definition Itstands opposite the history of events of instances covering instead avery long time span in an interdisciplinary social science framework In along essay Braudel noted that one can then see both structural crises ofeconomies and structures that constrain human society and develop-ment For more see Fernand Braudel ldquoHistory and the Social SciencesThe Longue Dureacuteerdquo in Fernand Braudel On History trans SarahMatthews (Chicago University of Chicago Press 1980) 27 The essaywas originally printed in the Annales ESC no 4 (OctoberndashDecember1958)

9 A great overview can be found at ldquoText Analysisrdquo Tooling Up for DigitalHumanities 2011 wwwtoolingupstanfordedupage_id=981 ltviewed19 December 2012gt

10 Jean-Baptiste Michel Yuan Kui Shen Aviva Presser Aiden Adrian VeresMatthew K Gray The Google Books Team Joseph P Pickett DaleHoiberg Dan Clancy Peter Norvig Jon Orwant Steven Pinker MartinA Nowak Erez Lieberman Aiden ldquoQuantitative Analysis of CultureUsing Millions of Digitized Booksrdquo Science 331 (14 January 2011)176ndash82 The tool is publicly accessible at ldquoGoogle Books NgramViewerrdquo Google Books 2010 wwwbooksgooglecomngrams ltviewed11 January 2013gt

11 Anthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

12 Comment by Jean-Baptiste Michel and Erez Lieberman Aiden onAnthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

13 For more information see the Canadian Century Research InfrastructureProject website at wwwccriuottawacaCCRIHomehtml and the Programme de recherche en deacutemographie historique at wwwgeneologyumontrealcafrleprdhhtm

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

14 ldquoWith Criminal Intentrdquo CriminalIntentorg wwwcriminalintentorgltviewed 20 December 2012gt

15 For an overview see ldquoAward Recipients mdash 2011rdquo Digging into DataWebpage wwwdiggingintodataorgHomeAwardRecipients2011tabid185Defaultaspx ltviewed 20 December 2012gt

16 One can do no better than the completely free Programming Historianwhich uses entirely free and open-source programming tools to learn thebasics of computational history See The Programming Historian 2 2012-Present wwwprogramminghistorianorg ltviewed 11 January 2013gt

17 Michael Katz The People of Hamilton Canada West Family and Class ina Mid-Nineteenth-Century City (Cambridge MA Harvard UniversityPress 1975) Also see A Gordon Darroch and Michael D OrnsteinldquoEthnicity and Occupational Structure in Canada in 1871 The VerticalMosaic in Historical Perspectiverdquo Canadian Historical Review 61 no 3(1980) 305ndash33

18 Ian Anderson ldquoHistory and Computingrdquo Making History The ChangingFace of the Profession in Britain 2008 wwwhistoryacukmakinghistoryresourcesarticleshistory_and_computinghtml ltviewed 20 December2012gt

19 Christopher Collins ldquoText and Document Visualizationrdquo TheHumanities and Technology Camp Greater Toronto Area VisualizationBootcamp University of Ontario Institute of Technology Oshawa Ont21 October 2011

20 The quotations by Luther and Poe are commonly used to make thispoint For the best example see Clay Shirky Cognitive SurplusCreativity and Generosity in a Connected Age Google eBook Penguin2010 For Mumford see Lewis Mumford The Myth of the Machine vol2 The Pentagon of Power (New York Harcourt Brace 1970) 182 asquoted in Gleick The Information A History A Theory A Flood (NewYork Pantheon 2011) 404

21 Adam Ostrow ldquoCisco CRS-3 Download Library of Congress in JustOver One Secondrdquo Mashable (9 March 2010) accessed onlinewwwmashablecom20100309cisco-crs-3

22 A note on data size is necessary The basic unit of information is a ldquobyterdquo (8 bits) which is roughly a single character A kilobyte (KB) is athousand bytes a megabyte (MB) a million a gigabyte (GB) a billion aterabyte (TB) a trillion and a petabyte (PB) a quadrillion Beyond thisthere are exabytes (EB) zettabytes (ZB) and yottabytes (YB)

23 These figures can of course vary depending on compression methodschosen Their math was based on an average of 8MB per book if it wasscanned at 600dpi and then compressed There are 26 million books inthe Library of Congress For more see Mike Ashenfelder ldquoTransferringlsquoLibraries of Congressrsquo of Datardquo Library of Congress Digital Preservation

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Blog 11 July 2011wwwblogslocgovdigitalpreservation201107transferring-libraries-of-congress-of-data ltviewed 11 January 2013gt

24 Gleick 232 25 ldquoAbout the Library of Congress Web Archivesrdquo Library of Congress Web

Archiving FAQs Undated wwwlocgovwebarchivingfaqhtmlfaqs_02ltviewed 11 January 2013gt

26 ldquo10000000000000000 Bytes Archivedrdquo Internet Archive Blog 26October 2012 wwwblogarchiveorg2012102610000000000000000-bytes-archivedltviewed 19 December 2012gt

27 Jean-Baptiste Michel et al 176ndash82 Published Online Ahead of Print16 December 2010 wwwsciencemagorgcontent3316014176ltviewed 29 July 2011gt See also Scott Cleland ldquoGooglersquos PiracyLiabilityrdquo Forbes 9 November 2011 wwwforbescomsitesscottcle-land20111109googles-piracy-liability ltviewed 11 January 2013gt

28 ldquoGovernment of Canada Web Archiverdquo Library and Archives Canada 17October 2007 wwwcollectionscanadagccawebarchivesindex-ehtmlltviewed 11 January 2013gt

29 For background see Bill Curry ldquoVisiting Library and Archives inOttawa Not Without an Appointmentrdquo Globe and Mail 1 May 2012wwwtheglobeandmailcomnewspoliticsottawa-notebookvisiting-library-and-archives-in-ottawa-not-without-an-appointmentarticle2418960 ltviewed 22 May 2012gt and the announcement ldquoLAC Begins Implementation of New Approach to service Deliveryrdquo Library and Archives Canada February 2012wwwcollectionscanadagccawhats-new013-560-ehtml ltviewed 22May 2012gt For professional responses see the ldquoCHArsquos comment onNew Directions for Library and Archives Canadardquo CHA Website March2012 wwwcha-shccaenNews_39items19html ltviewed 19December 2012gt and ldquoSave Library amp Archives Canadardquo CanadianAssociation of University Teachers launched AprilMay 2012 wwwsavelibraryarchivesca ltviewed 19 December 2012gt

30 Ian Milligan ldquoThe Smokescreen of lsquoModernizationrsquo at Library andArchives Canadardquo ActiveHistoryca 22 May 2012wwwactivehistoryca201205the-smokescreen-of-modernization-at-library-and-archives-canada ltviewed 19 December 2012gt

31 ldquoSave LAC May 2012 Campaign Updaterdquo SaveLibraryArchivesca May2012 wwwsavelibraryarchivescaupdate-2012-05aspx ltviewed 19December 2012gt

32 The challenges and opportunities presented by such projects is extensively discussed in Daniel J Cohen ldquoThe Future of Preserving the Pastrdquo CRM The Journal of Heritage Stewardship 2 no 2

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

(Summer 2005) 6ndash19 also available freely online at the RoyRosenzweig Center for History and New Mediawwwchnmgmueduessays-on-history-new-mediaessaysessayid=39For more see ldquoSeptember 11 Digital Archive Saving the Histories of September 11 2001rdquo Centre for History and New Media andAmerican Social History Project Center for Media and Learningwww911digitalarchiveorg ltviewed 11 January 2013gt ldquoHurricaneDigital Memory Bank Collecting and Preserving the Stories of Katrina and Ritardquo Center for History and New Media wwwhurricanearchiveorg ltviewed 11 January 2013gt and ldquoOccupyArchive Archiving the Occupy Movements from 2011rdquo Centre forHistory and New Media wwwoccupyarchiveorg ltviewed 11 January2013gt

33 Daniel Cohen and Roy Rosenzweig Digital History A Guide to Gathering Preserving and Presenting the Past on the Web(Philadelphia Temple University Press 2005) chapter on ldquoPreservingDigital Historyrdquo The chapter is available fully online atwwwchnmgmuedudigitalhistorypreserving5php ltviewed 19December 2012gt

34 For an overview see ldquoPreserving Our Digital Heritage The NationalDigital Information Infrastructure and Preservation Program 2010 Report a Collaborative Initiative of the Library of CongressrdquoDigitalpreservationgov wwwdigitalpreservationgovmultimediadocumentsNDIIPP2010Report_Postpdf ltviewed 19 December 2012gt

35 Steve Meloan ldquoNo Way to Run a Culturerdquo Wired 13 February 1998available online via the Internet Archivewwwwebarchiveorgweb20000619001705httpwwwwiredcomnewsculture012841030100html ltviewed 19 December 2012gt

36 Some of this has been discussed in greater depth on my website See theseries beginning with Ian Milligan ldquoWARC Files A Challenge forHistorians and Finding Needles in Haystacksrdquo ianmilliganca 12December 2012 wwwianmilliganca20121212warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks ltviewed 20 December2012gt

37 Unstructured data as opposed to structured data is important for histo-rians but brings additional challenges Some projects have producedstructured data novels marked up using XML encoding for example soone can immediately see place names dates character names and soforth This makes it much easier for a computer to read and understand

38 See William Noblett ldquoDigitization A Cautionary Talerdquo New Review ofAcademic Librarianship 17 (2011) 3 Tobias Blanke Michael Bryantand Mark Hedges ldquoOcropodium Open Source OCR for Small-ScaleHistorical Archivesrdquo Journal of Information Science 38 no 1 (2012) 83

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 11: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

rather than having sources mediated through the lens of outside dis-ciplines or commercial interests Algorithms may be one meaningfulway to make sense of cultural trends or extract meaningful informa-tion from billions of tweets but we must ensure professionalhistorical input goes into their crafting

The Third Wave of Computational History

Historians have fruitfully used computers in the past as part of twoprevious waves The 1960s and 1970s saw computational tools usedfor demographic population and economic histories a trend that inCanada saw full realization in comprehensive studies such as MichaelKatzrsquos The People of Hamilton Canada West17 As Ian Anderson hasconvincingly noted in a review of history and computing studies ofthis sort saw the field become associated with quantitative studies18

After a retreat from the spotlight computational history rose again inthe 1990s with the advent of the personal computer graphical userinterfaces and overall improvements in ease of use Punch cards couldnow give way to database programs GIS and early online networkssuch as H-Net and Usenet We are now I believe on the cusp of athird revolution in computational history thanks to three main fac-tors decreasing storage costs the power of the Internet anddistributed cloud computing and the rise of professionals dealingwith both digital preservation and open-source tools

Decreasing storage costs have led to massive amounts of infor-mation being preserved as Gleick noted in this articlersquos epigraph Itis important to note that this information overload is not newPeople have long worried about the impact of too much informa-tion19 In the sixteenth century responding to the rise of the printingpress the German priest Martin Luther decried that the ldquomultitudeof books [were] a great evilrdquo In the nineteenth century Edgar AllanPoe bemoaned ldquo[t]he enormous multiplication of books in everybranch of knowledge is one of the greatest evils of this agerdquo Asrecently as 1970 American historian Lewis Mumford lamented ldquotheoverproduction of books will bring about a state of intellectual ener-vation and depletion hardly to be distinguished from massiveignorancerdquo20 The rise of born-digital sources must thus be seen in

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

this continuous context of handwringing around the expansion andrise of information

Useful historical information is being preserved at a rate that isaccelerating with every passing day One useful metric to comparedataset size is that of the American Library of Congress (LOC)Indeed a ldquoLOCrdquo has become shorthand for a unit of measurementin the digital humanities Measuring the actual extant of the printedcollection in terms of data is notoriously difficult James Gleick hasclaimed it to be around 10TB22 whereas the Library of Congressitself claims a more accurate figure is about 200TB23 While thesetwo figures are obviously divergent neither is on an unimaginablescale mdash personal computers now often ship with a terabyte or moreof storage When Claude Shannon the father of information theorycarried out a thought experiment in 1949 about items or places thatmight store ldquoinformationrdquo (an emerging field which viewed messagesin their aggregate) he scaled sites from a digit upwards to the largestcollection he could imagine The list scaled up logarithmically from10 to the power of 0 to 10 to the power of 13 If a single-spaced pageof typing is 104 107 the Proceedings of the Institute of Radio Engineers109 the entire text of the Encyclopaedia Brittanica then 1014 was ldquothelargest information stockpile he could think of the Library ofCongressrdquo24

But now the Library of Congressrsquo print collection mdash somethingthat has taken two centuries to gather mdash is dwarfed by their newestcollection of archived born-digital sources the vast majority only amatter of years old compared to the much wider date-range of non-digital sources in the traditional collection The LOC has beguncollecting its own born-digital Internet archive multiple snapshotsare taken of webpages to show how they change over time Even asa selective curated collection drawing on governmental politicaleducational and creative websites the LOC has already collected254TB of data and adds 5TB a month25 The Internet Archivethrough its Wayback Machine is even more ambitious It seeks toarchive every website on the Internet While its size is also hard toput an exact finger on as of late 2012 it had over 10 Petabytes ofinformation (or to put it into perspective a little over 50 Library ofCongresses if we take the 200TB figure)26

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

This is to some degree comparing apples and oranges theLibrary of Congress is predominantly print whereas the InternetArchive has considerable multimedia holdings which includesvideos images and audio files The LOC is furthermore a morecurated collection whereas the Internet Archive draws from a widerrange of producers The Internet offers the advantages and disadvan-tages of being a more democratic archive For example a Geocitiessite (an early way for anybody to create their own website) created bya 12-year old Canadian teenager in the mid-1990s might be pre-served by the Internet Archive whereas it almost certainly would notbe saved by a national archive This difference in size and collectionsmanagement however is at the root of the changing historiansrsquotoolkit If we make use of these large data troves we can access a newand broad range of historical subjects

Conventional sources are also part of this deluge Google Bookshas been digitizing the printed word and it currently has 15 millionworks completed by 2020 it audaciously aims to digitize every bookever written While copyright issues loom the project will allowresearchers to access aggregate word and phrase data as seen in theGoogle n-gram project27 In Canada Library and Archives Canada(LAC) has a more circumscribed digitization project collecting ldquoarepresentative sample of Canadian websitesrdquo focusing particularlyon Government of Canada webpages as well as a curated approach

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

preserving case studies such as the 2006 federal election Currentlyit has 4TB of federal government information and 7TB of totalinformation in its larger web archive (a figure which as I note belowis unlikely to drastically improve despite ostensible goals of archivallsquomodernizationrsquo)28

In Canada LAC has mdash on the surface mdash has been advancingan ambitious digitization program as it realizes the challenges nowfacing archivists and historians Daniel J Caron the Librarian andArchivist of Canada has been outspoken on this front with severalpublic addresses on the question of digital sources and archivesThroughout 2012 primary source digitization at LAC preoccupiedcritics who saw it as a means to depreciate on-site access TheCanadian Association of University Teachers the Canadian HistoricalAssociation as well as other professional organizations mountedcampaigns against this shift in resources29 LAC does have a pointhowever with their modernization agenda The state of Canadarsquosonline collections are small and sorely lacking when compared totheir on-site collections and LAC does need to modernize and achievethe laudable goal of reaching audiences beyond Ottawa

Nonetheless digitization has unfortunately been used as asmokescreen to depreciate overall service offerings30 The modern-ization agenda would also see 50 percent of the digitization staffcut31 LACrsquos online collections are small they do not work withonline developers through Application Programming Interfaces(APIs a way for a computer program to talk to another computerand speed up research mdash Canadianaca is incidentally a leader in thisarea) and there have been no promising indications that this state ofaffairs will change Digitization has not been comprehensive Onlinepreservation is important and historians must fight for both on-siteand on-line access

A number of historians have recognized this challenge and areplaying an instrumental role in preserving the born-digital past TheRoy Rosenzweig Center for History and New Media at GeorgeMason University launched several high-profile preservation pro-jects the September 11th Digital Archive the Hurricane DigitalMemory Bank (focusing on Hurricanes Katrina and Rita) and as ofwriting the Occupy archive32 Massive amounts of online content

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

is curated and preserved photographs news reports blog posts andnow tweets These complement more traditional efforts of collectingand preserving oral histories and personal recollections which arethen geo-tagged (tied to a particular geographical location) tran-scribed and placed online These archives serve a dual purposepreserving the past for future generations while also facilitating easydissemination for todayrsquos audiences

Digital preservation however is a topic of critical importanceMuch of our early digital heritage has been lost Many websites espe-cially those before the Internet Archiversquos 1996 web archiving projectlaunch are completely gone Early storage mediums (from tape toearly floppy disks) have become obsolete and are now nearly inac-cessible Compounding this software development has seen earlyproprietary file formats depreciated ignored and eventually madewholly inaccessible33 Fortunately the problem of digital preservationwas recognized by the 1990s and in 2000 the American governmentestablished the National Digital Information Infrastructure andPreservation Program34 As early as 1998 digital preservationistsargued ldquo[h]istorians will look back on this era hellip and see a period ofvery little information A lsquodigital gaprsquo will span from the beginningof the wide-spread use of the computer until the time we eventuallysolve this problem What wersquore all trying to do is to shorten thatgaprdquo To bear home their successes that quotation is taken from anInternet Archive snapshot of Wired magazinersquos website35

Indeed greater attention towards rigorous metadata (which isbriefly put data about data often preserved in plain text format)open-access file formats documented by the InternationalOrganization for Standardization (ISO) and dedicated digitalpreservation librarians and archivists gives us hope for the future36

It is my belief that cloud storage the online hosting of data in largethird-party offsite centres mitigates the issue of obsolete storagemedium Furthermore that the move towards open source file for-mats even amongst large commercial enterprises such as Microsoftprovides greater documentation and uniformity

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A Brief Case Study Accessing and Navigating Canadarsquos DigitalCollections With an Emphasis on Tools for Born-DigitalSources

What can we do about this digital deluge There are no simpleanswers but historians must begin to conceptualize new additions totheir traditional research and pedagogical toolkits In the section thatfollows I will do two things introduce a case study and several toolsboth proprietary and those that are open and available off-the-shelfto reveal one way forward

For a case study I have elected to use Canadarsquos DigitalCollections (CDC) a collection of lsquodeadrsquo websites removed fromtheir original location on the World Wide Web and now digitallyarchived at LAC In this section I will introduce the collection usingdigitally-archived sources from the early World Wide Web and dis-cuss my research methodology before illustrating how we can derivemeaningful data from massive arrays of unstructured data37

This is a case study of a web archive Several of these method-ologies could be applied to other formats including large arrays oftextual material whether born-digital or not A distant reading of thistype is not medium specific It does however assume that the text ishigh quality Digitized primary documents do not always have textlayers and if they are instituted computationally through OpticalCharacter Recognition errors can appear Accuracy rates for digitizedtextual documents can range from as low as 40 percent for seven-teenth and eighteenth century documents to 80ndash90 percent formodern microfilmed newspapers to over 99 percent for typesetword-processed documents38 That said while historians today maybe more interested in applying distant reading methodologies tomore traditional primary source materials it is my belief that in thenot-so-distant future websites will be an increasingly significantsource for social cultural and political historians

What are Canadarsquos Digital Collections In 1996 IndustryCanada funded by the federal Youth Employment Strategy facili-tated the development of Canadian heritage websites by youth(defined as anyone between the age of 15 and 30) It issued contractsto private corporations that agreed to use these individuals as web

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

developers for the twofold purpose of developing digital skills andinstilling respect for and engagement with Canadian heritage andpreservation39 The collection initially called SchoolNet butrebranded as Canadarsquos Digital Collections in 1999 was accessiblefrom the CDC homepage After 2004 the program was shut downand the vast majority of the websites were archived on LACrsquos web-page This collection is now the largest born-digital Internet archivethat LAC holds worthy of study as content as well as thinkingabout methodological approaches

Conventional records about the program are unavailable in thetraditional LAC collection unsurprising in light of both its recentcreation and backlogs in processing and accessioning archival mate-rials We can use one element of our born-digital toolkit the InternetArchiversquos WaybackMachine (its archive of preserved early webpages)to get a sense of what this project looked like in its heyday The mainwebpage for the CDC was wwwcollectionsicgc However a visitortoday will see nothing but a collection of broken image links (two ofthem will bring you to the English and French-language versions ofthe LAC-hosted web archive although this is only discerniblethrough trial and error or through the sitersquos source code)

With the WaybackMachine we have a powerful albeit limitedopportunity to ldquogo back in timerdquo and experience the website as previ-ous web viewers would have (if a preserved website is available ofcourse) Among the over 420 million archived websites we can searchfor and find the original CDC site Between 12 October 1999 andthe most recent snapshot or web crawl (of the now-broken site) on 6July 2011 the site was crawled 410 times40 Unfortunately every siteversion until the 15 October 2000 crawl has been somehow cor-rupted Content has been instead replaced by Chinese characterswhich translate only into place names and other random words Issuesof early digital preservation noted above have reared their head inthis collection

From late 2000 onward however fully preserved CDC sites areaccessible Featured sites new entries success stories awards curric-ular units and prominent invitations to apply for project fundingpopulate the brightly coloured website The excitement is palpableJohn Manley then Industry Minister declared ldquoit is crucial that

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Canadarsquos young people get the skills and experience they need to par-ticipate in the knowledge-based economy of the futurerdquo and thatCDC ldquoshows that young people facing significant employment chal-lenges can contribute to and learn from the resources available onthe Information Highwayrdquo41 The documentation reflecting thelower level of technical expertise and scarcity of computer hardwareat the time was comprehensive participants were encouraged toidentify a collection digitize it process it (by cleaning up the data)preserve it on a hard drive and display it on the web Issues of copy-right were discussed as well as preservation issues and hardwarerequirements for computers (a computer with an Intel 386 processor8MB RAM a 1GB hard drive and 288bps modem was recom-mended) and training information for employers educators andparticipants42 Sample proposals community liaison ideas amidstother information set out a very comprehensive roadmap for a grass-roots bottom-up approach to digital preservation and heritage

Moving through the preserved websites is an interesting snap-shot at how web design standards changed Colour disappearsreplaced by a more subdued coordinated federal website standard(still largely in effect today) Surveys greet viewers from 2003onwards trying to learn how viewers discovered the site whetherthey think it is a good educational resource (they must have beenrelieved that the majority thought so mdash in 2003 some 84 percentfelt that it was a ldquogoodrdquo resource the best of its type)43 By 2004 theprogram was engaging in active remembrance of its past programsas it neared its ten-year anniversary Proposals stopped beingaccepted in July 2004 but the 600 webpages were still accessiblewith a comprehensive subject index alphabetical listing success sto-ries and detailed copyright information

The Canadarsquos Digital Collections site began to move towardsremoval in November 2006 and was initially completely gone A visi-tor to the site would learn that the content ldquois no longer availablerdquo andan e-mail address was provided for contributors to get their informationback44The information was completely inaccessible and by early 2007a bold notice announced that the now-empty site would close as of 2April 2007 Luckily by April 2007 a notice appeared with a declara-tion that the collections were now archived by Library and Archives

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Canada just in time for the website to completely disappear by July2007 In the archived version subject indexes and headings were goneAll that remained then and today for that matter is a straight-forwardalphabetical list and a disclaimer that noted ldquo[y]ou are viewing a doc-ument archived by Library and Archives Canada Please noteinformation may be out of date and some functionality lostrdquo

The question now is what we can do with this collection ofwebsites Could a distant reading as opposed to a close reading ona targeted level tell us something about this collectionFortunately we have a variety of tools that can help us navigate thislarge array of information several of which I use to navigate largeamounts of data First I create my own tools through computerprogramming This work is showcased in the following section I program in Wolfram Researchrsquos Mathematica 9 Mathematica is anintegrated platform for technical computing that allows users toprocess visualize and interact with exceptional arrays of informa-tion45 It is a language with an engaged and generous user base thatcomplements exhaustive well-written documentation fromWolfram Research To emphasize its ease one line of code couldimport a website in perfectly readable plaintext (avoiding unicodeerrors) as follows input=Import[ldquohttpactivehis-torycaldquordquoPlaintextrdquo] and a second one could make itall in lower text for textual analysis as such lowerpage=ToLowerCase[input] Variables are a critical building blockIn the former example ldquoinputrdquo now holds the entire plaintext of thewebsite ActiveHistoryca and then ldquolowerpagerdquo holds it in lowercaseOnce we have lower-cased information we are quickly able to extractthe following frequent words frequent phrases and could thus cre-ate a dynamic database of n-grams word clouds or run other textualanalysis programs on the material Much of the syntax adheres to arecognizable English-language format

Even more promising is Mathematicarsquos integration with theWolfram|Alpha database (itself accessible at wolframalphacom) acomputational knowledge engine that promises to put incredibleamounts of information at a userrsquos fingertips With a quick com-mand one can access the temperature at Torontorsquos Lester B PearsonInternational Airport at 500 pm on 18 December 1983 the rela-

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

tive popularity of names such as ldquoEdithrdquo or ldquoEmmardquo the Canadianunemployment rate in March 1985 or the Canadian GrossDomestic Product in December 199546 This allows one to quicklycross-reference findings with statistical information or find coinci-dental points of comparison As universities are now being activelyencouraged to contribute their information to Wolfram|Alpha thiswill become an increasingly powerful database

For many researchers the downside of Mathematica is a con-siderable one It is proprietary software and relatively expensive forstudent and non-student alike Outputs however can be dissemi-nated using the Computational Document Format (CDF) enablinghistorians to put their work online and entice the general public todynamically work with their models and findings

Programming is not the be all and end all Information visual-ization in the humanities is a growing international field withCanadian scholars heavily involved The most important and acces-sible example is Voyant Tools47 It is a powerful suite of textualanalysis tools Uploading onersquos own text or collection of texts draw-ing on a webpage or analyzing existing bodies of work one sees thefollowing a ldquoword cloudrdquo of a corpus click on any word to see itsrelative rise and fall over time (word trends) access the original textand click on any word to see its contextual placement within indi-vidual sentences It ably facilitates the ldquodistancerdquo reading of anyreasonably-sized body of textual information and should be a begin-ning point for those interested in this field The only downside isthat Voyant can choke on very large amounts of information as it isa textual analysis rather than robust data mining tool

There are several other complementary tools A notable one is theMany Eyes website hosted by IBM Research Many Eyes allows you toupload datasets be they textual tabular or numerical There are manyvisualization options provided you can plot relationships betweendata compare values (ie bar charts and bubble charts) track dataover time and apply information to maps The only downside toMany Eyes is that you must register and make your datasets available toall other registered researchers as well as IBM This limits your abilityto use copyrighted or proprietary information

More advanced technical users can avail themselves of a host of

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

textual analysis programs found elsewhere on the web The mostcomprehensive gathering of these tools can be found with theSoftware Environment for the Advancement of Scholarly Research(SEASR) and its Meandre workbench Meandre which can beinstalled as a local server on your system or mdash in a best case scenariomdash hosted by an institutional IT department allows you to tap intoa variety of workflows to analyze data48 One workflow mentionedin this article is the topic modeling toolkit found within the MachineLearning for Language Toolkit (MALLET) package which clustersunstructured data into ldquotopicsrdquo which appear together giving a quicksense of a documentrsquos structure49 Other workflows found withinSEASR include Dunningrsquos log-likelihood statistic which compares aselected document to a broader reference corpus For example youcould compare the lyrics that appear in 1968 against all lyrics in thepostwar era and learn what makes 1968 most unique50 sentimentanalyses which attempt to plot changes in emotional words andphrases over time and a variety of entity extraction algorithms(which find ldquopeoplerdquo ldquoplacesrdquo and so forth in a text) Corpuses thatwere previously too big to extract meaningful information from cannow be explored relatively quickly

As befits a technical side to history all of the freely availabletools discussed briefly here (notably wget SEASR and MALLET)are discussed in more technical terms in the peer-reviewedProgramming Historian 2 or on my own blog at ianmilliganca I haveelected to leave them online rather than provide commands in thisarticle they require some technical knowledge and directly workingfrom the web is quicker than here Please consider the following sec-tion an applied version of some of the lessons you can learn throughthese processes

From Web Collection to Dynamic Finding Aids

Let us briefly return to our archival metaphor Todayrsquos historiandealing with conventional sources often has to physically travel to anarchive requisition boxes perhaps take photographs of materials iftime is tight and otherwise take notes With born-digital sourceshowever the formula is reversed Our sources can come to us In a

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

few keystrokes we can bring an unparalleled quantity of informationto our home or work computers With Canadarsquos Digital Collectionsa single command line entry brought the entire archive onto my sys-tem it took days to process the information and I had to leave mycomputer on but it did not require any active user engagement

How big is the CDC collection in terms of sheer informationThe plaintext without any HTML markup is approximately7805135 words If each page contained 250 words it would be theequivalent of a 31220 page essay or 1000 dissertations However itis even more complicated than just reading 1000 dissertations Theinformation is spread across 73585 separate HTML files themselvespart of a much broader corpus of 278785 files Even the sheer sizeof the database requires big data methodologies an aggregate size of726 GB 360 MB of plaintext with HTML encoding reduced to alsquomerersquo 50MB if all encoding is removed A 50MB plain text docu-ment can quickly slow down even a relatively powerful personalcomputer There is no simple way to make this a conventionallyaccessible amount of information

How can we access this information Several different methodsare possible First we can work with the information on the internetitself or we can download it onto our own system I prefer the lattermethod as it decreases computational time adds flexibility andenables free rein to experiment with material Before you do thisthough you need to be familiar with the websitersquos ldquoterms of servicerdquoor the legal documentation that outlines how users are allowed to usea website resources or software In this case LAC provides generousnon-commercial reproduction terms information ldquomay be repro-duced in part or in whole and by any means without charge orfurther permission unless otherwise specifiedrdquo51

The CDC web archive index is organized like so the index islocated at wwwepelac-bacgcca100205301iccdcEAlphabetaspwith every subsequent website branching off the cdc folder Forexample one website would be hosted in wwwepelac-bacgcca100205301iccdcABbenevolat the next in hellipcdcabnatureand so on Using wget a free software program you can downloadall websites in a certain structure the program recursively downloadsinformation in a methodical fashion I first used wget to save the

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

cookies (little files kept on your system in this case one that notedthat you had seen the disclaimer about the archival nature of thewebsite) and then with an one-line command in my command lineto download the entire database52 Due to the imposition of a ran-dom wait and bandwidth limit which meant that I would downloadone file every half to one and a half seconds at a maximum of 30kilobytes a second the process took six days In the ProgrammingHistorian 2 I have an article expanding upon wget for humanitiesresearchers

Hypertext itself deserves a brief discussion The basic buildingblock of the web is the link one page is connected to another andpeople navigate through these webs While books and articles forexample are connected to each other by virtue of ideas and concretecitations these must often be inferred online HTML born-digitalsources are different in that the link offers a very concrete file One

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Visualizing the CDC Collection as a Series of Hyperlinks

quick visualization the circular cluster seen above lets us see theunique nature of this information We can go to each web pagescrape the links and see where they take viewers As we see in thebelow figure the CDC collection can be visualized as a series ofhyperlinks (each line) The cluster in the middle is the CDC collec-tion itself whereas the strands on the outside represent externalwebsites Using Mathematica I have generated a dynamic modelyou can move your mouse over each line learn what link it is andthus see where the collection fits into the broader Internet The siteis limited to following two links beyond the first page Through thisapproach we see the interconnection of knowledge and relationshipsbetween different sources At a glimpse we can see suggestions to fol-low other sites and determine if any nodes are shared amongst thevarious websites revealed

Visualization Moving Beyond Word Clouds

Where to begin with a large collection such as the Canadarsquos DigitalCollection corpus As a historian who is primarily interested in thewritten word my first step was to isolate the plaintext throughout thematerial so that I could begin rudimentary data mining and textualanalysis on the corpus For this I approached the CDC collection ontwo distinct levels mdash first the aggregate collection and second divid-ing the collection into each lsquofilersquo or individual website mdash derivingword and phrase data for each Imagine if you could look throughinto an archival collection and get a quick sense of what topics andideas were covered Taking that thought further imagine the utility ifyou could do this with individual archival boxes and even files Witha plethora of born-digital sources this is now a necessary step

On an aggregate level word and phrase data is a way to explorethe information within How often do various words within thenearly eight million words of the collection appear UsingMathematica we can quickly generate a list of words sorted by theirrelative frequency we learn that the word ldquoCanadardquo appears 26429times ldquocollectionsrdquo 11436 and ldquo[L]ouisbourgrdquo an astounding9365 times Reading such a list can quickly wear out a reader and isof minimal visual attractiveness Two quick decisions were made to

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

enhance data usability first the information was ldquocase-foldedrdquo allsent to lower-case this enables us to count the total appearances of aword such as ldquodigitalrdquo regardless of whether or not it appears at thebeginning of a sentence

Second when dealing with word frequency mdash as opposed tophrase frequency mdash ldquostop wordsrdquo should be removed as they werehere These are ldquoextremely common words which would appear to beof little value in helping select documents matching a user needrdquowords such as ldquotherdquo ldquoandrdquo ldquotordquo and so forth They matter forphrases (the example ldquoto be or not to berdquo is comprised entirely ofstop words) but will quickly dominate any word-level visualizationor list53 Other words may need to be added or removed from stoplists depending on the corpus for example if the phrase ldquopage num-berrdquo appeared on every page that would be useless information forthe researcher

The first step then is to use Mathematica to format word fre-quency information into Wordlenet readable format (if we put allof the text into that website it will crash mdash so we need to followthe instructions on that website) we can then draw up a wordcloud of the entire collection Here we can see both the advantagesand disadvantages of static word clouds What can we learn from this

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A Word Cloud of All Words in the CDC Corpus

illustration First without knowing anything about the sites we canimmediately learn that it is dealing with CanadaCanadians that itis a collection with documents in French and English that we aredealing with digitized collections and that there is a fair amount ofregional representation not based upon population (note thatldquoontariordquo is less frequent than both ldquonewfoundlandrdquo ldquonova scotiardquoand ldquoalbertardquo although ldquoqueacutebecrdquo and ldquoquebecrdquo is not terribly promi-nent) Certain kinds of history dominate communities familynames the built form (houses churches schools and so forth) At avery macro level we can learn several characteristics before doing anydeeper investigation

These forms of visualization need to be used with extreme cau-tion however First while visually attractive it leaves a severeamount of ambiguity are ldquoontariordquo and ldquoqueacutebecrdquo equally promi-nent for example or is there a slight divergence Even moreproblematically we cannot learn context from the above picturedoes the word ldquoschoolrdquo refer to built structures or educational expe-riences The same question with applies to ldquochurchrdquo Do mentionsof ldquonewfoundlandrdquo refer to historical experiences there or withexpatriates or in travel narratives Word clouds are useful first stepsand have their place but need to be used with extreme caution Formore scholarly applications tables offer much more precise inter-pretation As the second figure below demonstrates they can bemore helpful when dealing with the individual sites cuttingthrough ambiguity The figure below is a visualization of the web-site ldquopast and presentrdquo Despite the ambiguous title we can quicklylearn what this website contains from this visualization Wordle isthus a useful first step analogous to taking magically taking a quicklook through the side of an archival box We still need more infor-mation however to really make this form of textual analysishelpful To navigate this problem I have created my own dynamicfinding aid program using Mathematica which scales to any born-digital web collection

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Two approaches were taken First a textual representation of wordand phrase frequency allows us to determine the content of a given filerelatively quickly I wrote a short program to extract word frequency(stop words excluded) and phrase frequency For phrases we speak ofn-grams bigrams trigrams quadgrams and fivegrams although onecould theoretically do any higher number A bigram can be a combina-tion of two characters such as lsquobirsquo two syllables or mdash pertinent in ourcase mdash two words An example ldquocanada isrdquo A trigram is three wordsquadgram is four words and a fivegram is five words and so on

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Word Cloud of the File ldquoPast and Presentrdquo

A viewer for n-grams in the CDC Collection

The CDC collection demonstrates this approachrsquos utility Usingmy Mathematica textual CDC browser pictured above I randomlyexplored file 267 the ambiguously named ldquopast and presentrdquo web-site This is one of the largest individual websites 122363 words orabout 493 standard typed pages Such an ambiguous name could beabout almost anything Initially I decided to study word frequency

ldquoldquoalbertardquo 758rdquo ldquoldquolistenrdquo 490rdquo ldquoldquosettlersrdquo 467rdquoldquoldquonewrdquo 454rdquo ldquoldquocanadardquo 388rdquo ldquoldquolandrdquo 362rdquo ldquoldquoreadrdquo328rdquo ldquoldquopeoplerdquo 295rdquo ldquoldquosettlementrdquo 276rdquo ldquoldquochineserdquo268rdquo ldquoldquocanadianrdquo 264rdquo ldquoldquoplacerdquo 255rdquo ldquoldquoheritagerdquo254rdquo ldquoldquoedmontonrdquo 250rdquo ldquoldquoyearsrdquo 238rdquo ldquoldquowestrdquo 236rdquoldquoldquoraymondrdquo 218rdquo ldquoldquoukrainianrdquo 216rdquo ldquoldquocamerdquo 213rdquoldquoldquocalgaryrdquo 209rdquo ldquoldquonamesrdquo 206rdquo ldquoldquoranchrdquo 204rdquoldquoldquoupdatedrdquo 193rdquo ldquoldquohistoryrdquo 190rdquo ldquoldquocommunitiesrdquo183rdquo hellip

Rather than having to visit the website at a glance we can almostimmediately see what the site is about More topics become apparentwhen one looks at trigrams for example in this automatically gen-erated output of generated top phrases

ldquoldquofirstrdquo ldquopeoplerdquo ldquosettlersrdquo 100rdquo ldquoldquotherdquo ldquounitedrdquoldquostatesrdquo 72rdquo ldquoldquoonerdquo ldquoofrdquo ldquotherdquo 59rdquo ldquoldquoofrdquo ldquotherdquoldquowestrdquo 58rdquo ldquoldquosettlersrdquo ldquolastrdquo ldquoupdatedrdquo 56rdquo ldquoldquopeoplerdquoldquosettlersrdquo ldquolastrdquo 56rdquo ldquoldquoalbertansrdquo ldquofirstrdquo ldquopeoplerdquo 56rdquoldquoldquoadventurousrdquo ldquoalbertansrdquo ldquofirstrdquo 56rdquo ldquoldquotherdquo ldquobarrdquoldquourdquo 55rdquo ldquoldquoopeningrdquo ldquoofrdquo ldquotherdquo 54rdquo ldquoldquocommunitiesrdquoldquoadventurousrdquo ldquoalbertansrdquo 54rdquo ldquoldquonewrdquo ldquocommunitiesrdquoldquoadventurousrdquo 54rdquo ldquoldquobarrdquo ldquourdquo ldquoranchrdquo 53rdquo ldquoldquolistenrdquoldquotordquo ldquotherdquo 52rdquo hellip

Through a few clicks we can navigate and determine the basiccontours mdash from a distance mdash of what the site pertains to In aviewer you can use a slider to dynamically see more informationor less information quickly toggling between bigrams trigramsand quadgrams allowing you to quickly hone in on immediately

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

relevant information This can be useful but we need to moveback even further to achieve a distant reading of this born-digitalcollection

If one zooms back out to this very distant level we can extractn-gram data from the entire collection We have two approachesfirst aggregate appearances of a given term (if ldquoaboriginalrdquo appears457 times then n=457) or second relative appearances of a giventerm (if ldquoaboriginalrdquo appears in 45 of 384 sites then n=45384 or01171875) In my other work on music lyrics the latter optionenables you to control for the lsquochorusrsquo effect as the former can be drowned out by one song with repetitive words Given thenature of this collection I elected for the former approach Whatcan we learn from trigrams for example After an initial set of trigrams pertaining to the Louisbourg digitization project whichhas boilerplate language on each page as well as common ldquostop-wordrdquo equivalents of phrases (especially pronounced in French) we come across the first set of meaningful phrases excerptedbelow

ldquoexperiencerdquo ldquoinrdquo ldquotherdquo 919 ldquotransferredrdquo ldquotordquo ldquotherdquo918 ldquofundedrdquo ldquobyrdquo ldquotherdquo 918 ldquoofrdquo ldquotherdquo ldquoprovincialrdquo915 ldquothisrdquo ldquowebrdquo ldquositerdquo 915 ldquo409rdquo ldquoregistryrdquo ldquo2rdquo914 ldquoarerdquo ldquonordquo ldquolongerrdquo 914 ldquogouvernementrdquo ldquodurdquo ldquocanadardquo 910 ldquouserdquo ldquothisrdquo ldquolinkrdquo 887ldquopagerdquo ldquoorrdquo ldquoyourdquo 887 ldquofollowingrdquo ldquopagerdquo ldquoorrdquo 887 ldquoautomaticallyrdquo ldquotransferredrdquo ldquotordquo 887 ldquoberdquoldquoautomaticallyrdquo ldquotransferredrdquo 887 ldquowillrdquo ldquoberdquo ldquoautomaticallyrdquo 887 ldquoindustryrdquo ldquoyourdquo ldquowillrdquo 887ldquomultimediardquo ldquoindustryrdquo ldquoyourdquo 887 ldquopracticalrdquo ldquoexperiencerdquo ldquoinrdquo 887 ldquogainedrdquo ldquopracticalrdquo ldquoexperiencerdquo887 ldquotheyrdquo ldquogainedrdquo ldquopracticalrdquo 887 ldquowhilerdquo ldquotheyrdquoldquogainedrdquo 887 ldquocanadiansrdquo ldquowhilerdquo ldquotheyrdquo 887ldquoyoungrdquo ldquocanadiansrdquo ldquowhilerdquo 887 ldquoshowcaserdquo ldquoofrdquoldquoworkrdquo 887 ldquoexcellentrdquo ldquoshowcaserdquo ldquoofrdquo 887 hellip

Here we see an array of common language funding informa-tion link information project information spread across over 887

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

websites Reading down we see information about governments jobexperience project information initiatives re-structuring and soforth At a glance we can see what language is common to all web-sites as opposed to unique information Moving down we find themore common but less universal ldquothe history ofrdquo for example ldquoher-itage community foundationrdquo libraries and so forth By the end ofthe list we can move into the unique or very rare phrases that appearonly once twice or three times

Moving beyond word and phrase frequency there are a fewmore sophisticated off-the-shelf options that can help us make senseof large arrays of unstructured data Topic modeling is among themost developed as Princetonrsquos David M Blei (a Computer Scienceprofessor) describes in a general introduction to the topic

While more and more texts are available online we simplydo not have the human power to read and study them toprovide the kind of [traditional] browsing experiencedescribed above To this end machine learning researchershave developed probabilistic topic modeling a suite of algo-rithms that aim to discover and annotate large archives ofdocuments with thematic information Topic modelingalgorithms are statistical methods that analyze the words ofthe original texts to discover the themes that run throughthem how those themes are connected to each other andhow they change over time hellip Topic modeling enables usto organize and summarize electronic archives at a scalethat would be impossible by human annotation55

Using the aforementioned SEASR Meandre workbench (or viaMALLET itself and a command-line interface) we can quickly beginto run our born-digital information through the topic-modelingalgorithm The results are notable and come in two formats a wordcloud visualization as well as an XML file containing a line-by-linebreakdown of each topic Taking the aforementioned file 267 (ldquopastand presentrdquo) we can begin to get a sense of the topics this large web-site discusses

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

At a glance we can learn about some of the main themes the struc-ture of Alberta (Alberta towns bridges lakes) the geographic shapeof the area (life land lands areas links grass and interestingly lan-guage) the stories (story knight archives but also interestinglyknight and wife) and so forth The word cloud gives us a sense ofthe most relevant terms that define each topic but we also have a3200-line XML file broken into topics which lets us further refineour inquiry

A last set of examples will help us move from the aggregate levelof hundreds of thousands of lines of text to the individual line inessence creating our own search engine to navigate this collection

There are a few good reasons why we should not use Google todo all of this in our own work First we do not know how Googleworks precisely It is a proprietary black box While the original algo-rithm is available the actual methods that go into determining theranking of search terms is a trade secret Second Google is optimizedfor a number of uses mainly to deliver the most relevant informa-tion to the user Historians and archivists are looking often forspecific information within old websites and it behooves us to exam-ine them through the lens of more specialized tools With thesehesitations in mind I have sought to develop my own rudimentarysearch engine for this born-digital collection

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A two-pronged approach can take us to relevant informationquickly In the following example I am carrying out a project onblack history We quickly search all of the CDC word frequency filesfor the occurrence of the word ldquofreedomrdquo ldquoblackrdquo and ldquolibertyrdquo(these terms of course require some background knowledge thesearch is shaped by the user) obtain a list and then map it onto achart in order to quickly visualize their locations in the collection Forexample in the graph above we see a visualization of a given term inthis case our three search terms As one moves the mouse over eachbar the frequency data for a given term appears We can also zoomand enhance this graph to use it as we wish A demonstration of thiscan be found online at wwwianmilliganca20130118cdc-tool-1

We can then move down into the data In the example aboveldquo047rdquo labeled ldquoblackgoldtxtrdquo contains 31 occurrences of the wordldquoontariordquo Moving to my Mathematica keyword-in-context routinewe can see where these occurrences of the word appear

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

A snapshot of the dynamic finding aid tool

We see that this website focuses mostly on southwestern Ontariowhere many oil fields are If we move forward we can see that it per-tains to the nineteenth century a particular region as well assubsequent memorialization activities (mention of a museum forexample) All this can be done from within the textual client asopposed to going out into individual websites Broad contoursemerge from this distant reading but they also facilitate deeper read-ing by allowing us to promptly realize critical sources

Another model through which we can visualize documents isthrough a viewer primarily focusing on term document frequencyand inverse frequency terms discussed below To prepare this mater-ial we are combining our earlier data about word frequency (that theword ldquoexperiencerdquo appears 919 times perhaps) with aggregate infor-mation about frequency in the entire corpus (ie the wordldquoexperiencerdquo appears 1766 times) The number of times that a termappears in a given website can be defined as tf (term frequency)Another opportunity is inverse document frequency where we canlearn what words are most distinct and unique in a given documentThe formula for this is the oft-cited tf-idf a staple of informationretrieval theory or tf times the logarithm of the number of docu-ments divided into the number of documents where the given termappears At a glance then we can see some of the unique words con-tained within a website if we select ldquoinverse document frequencyrdquo

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

At a very quick glance one can see the major words that make this aunique document This technique can be used on any textual docu-ment

Off-the-shelf tools can facilitate a deeper yet distant reading ofindividual websites as well especially the suite of research tools onwebsites such as Many Eyes and Voyant Using these tools howeverrequires that one puts individual data into each separate programrather than quickly navigating between them This stems from themassive amount of files and data involved Many Eyes limits files to5MB and Voyant has no hard and fast rule but can struggle withlarger files

This analysis was carried out on my personal computer55 Largerdata sets may necessitate a corresponding increase in computationalpower bringing social sciences and humanities researchers into con-tact with High Performance Computing (HPC) centres or consortiaThese extremely powerful computers provide researchers with com-puting power orders of magnitude above that of personal computersallowing results in hours instead of the weeks or months it might

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

take to process on a smaller system56 This has particular applicationin the basic processing of a very large web archive There are limita-tions however jobs must be submitted into queues in advancerather than on-site processing and perhaps most importantly HPCcentres prefer to run highly optimized code to minimize usage oftheir very expensive and scarcely allocated equipment As this level ofcoding expertise is often beyond humanities researchers researchteams must be formed

Humanities researchers would do well to familiarize themselveswith local HPC options many of which are more than happy towork on new and engaging questions These are accessible optionsin Ontario for example all but one university belongs to a HPCconsortium57 Yet as personal computing continues to dramaticallyincrease in speed and as storage prices continue to plummet textualanalysis is increasingly feasible on high-quality office systems An ini-tial exposure can generate questions and problems that if the needrears itself can subsequently be taken to the supercomputer

This case study is intended to provide one model for distantreading and the generation of finding aids in the interest of showingsome of the potential for working with born-digital sources It isintended to begin or jumpstart a conversation around the resourcesand skills necessary for the next generation A seemingly insur-mountable amount of information contained across over 73000files (as in the CDC collection) can be quickly navigated and iso-lated for researcher convenience

Conclusion Digital Literacy for the Next Generation ofHistorians

As a discipline I would like to suggest one path forward prepare thenext generation of historians at least some of them to deal with thedeluge of born-digital sources Fundamentally the challenge of born-digital sources is a pedagogical one For existing historians this is anissue of continual education for future historians an issue of profes-sional training (the annual Digital Humanities Summer Instituteheld at the University of Victoria each June is worth considering asan introduction to many digital concepts) Some off-the-shelf tools

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

can serve as gateways to deeper engagement notably the VoyantTools suite (wwwvoyant-toolsorg) and perhaps subsequently thelessons contained in the Programming Historian 2 That said digitalliteracy courses should become part of the essential pantheon of his-torical curricula This should not be mandatory but instead taughtalongside and within traditional historiography and research meth-ods courses In an era of diminishing resources and attention paid toknowledge translation and relevancy digital literacy courses are alsowell positioned to provide graduates with concrete portfolios andactionable skills If we want to be prepared to use digital method-ologies to study the post-1990 world we will need ready historians

Such a course should have six basic priorities First we must pro-vide historiographical context for new students on whatcomputational methods have been tried before How have quantita-tive and qualitative methods differed yet ultimately complementedeach other Second students in this course would need to develop anunderstanding of the digital basics cloud computing effectivebackup and versioning digital security and an understanding of howthe Internet is organized Third for effective cross-departmental andsustained engagement with digital history students must to learn howto digitally navigate conventional sources notable skill sets wouldinclude a familiarity with citation management software databasesand the possibility of using optical character recognition on a varietyof born-paper sources subsequently digitized through scanners orcameras This must also include a grasp of the scope and significanceof large digital depositories including Google Books the InternetArchive and others mentioned earlier Fourth students must learnhow to process large amounts of digital information this includes off-the-shelf visualization tools material discussed here and some of thephilosophical and methodological implications of a distance readingFifth students can learn basic programming through the ProgrammingHistorian Finally and perhaps most importantly students need anunderstanding of various ways to openly disseminate material on the web whether through blogging micro-blogging publishing webpages and so forth it is important to enable the great work that we do in the university to make its mark on the world Throughthis students will also come away from history departments with

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

comprehensive and marketable digital portfolios Such a coursewould focus more on homework than reading with special emphasisplaced on deliverable content such as websites interactive modelsand broadly implementable databases

Digital history and the readiness to work with born-digitalsources cannot stem out of pedagogy alone however Institutionalsupport is necessary While based on anecdotal evidence it appearsthat some university administrators have begun to embrace the shifttowards the digital humanities At the federal level SSHRC has iden-tified the Digital Economy as one of its priority areas58 Yet when itcomes to evaluation doctoral and post-doctoral researchers do nothave a ldquodigital humanitiesrdquo option open to them in terms of assessingcommittees leading to fears that their projects might be lost inbetween disciplinary areas of focus59 In light of the impending crisisof born-digital sources historians should encourage these initiatives

These sorts of digital projects do not always lend themselveswell to traditional publication venues The examples above from theCDC case study for example are best viewed as dynamically mal-leable models this will be one challenge as historians grapple withnew publication models This article argues first why digitalmethodologies are necessary and second how we can realize aspectsof this call The Internet is 20 years old and before historians realizeit we will be faced with the crisis of born-digital information over-load If history is to continue as the leading discipline inunderstanding the social and cultural past decisive movementtowards the digital humanities is necessary When the sea comes inwe must be ready to swim

IAN MILLIGAN is an assistant professor of Canadian history at theUniversity of Waterloo He is a founding co-editor of ActiveHistorycaan editor-at-large of the Programming Historian 2 and has written sev-eral articles in the fields of Canadian youth digital and labour history

IAN MILLIGAN est professeur adjoint en histoire canadienne agrave laUniversity of Waterloo Il est le fondateur et codirecteur du site

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Internet ActiveHistoryca ainsi qursquoun eacutediteur du site ProgrammingHistorian 2 Il a eacutecrit plusieurs articles au sujet de la jeunesse cana-dienne de la numeacuterisation et de lrsquohistoire des travailleurs

Endnotes

1 The question of when the past becomes a ldquohistoryrdquo to be professionallystudied is a tricky one Yet it is worth noting that the still-definitivework on the Baby Boom Generation appeared in 1996 with DougOwram Born at the Right Time A History of the Baby Boom Generationin Canada (Toronto University of Toronto Press 1996) This actuallyfollowed the earlier scholarly monograph Cyril Levitt Children ofPrivilege Student Revolt in the Sixties (Toronto University of TorontoPress 1984) and a host of other laudable works As academicdebatesworks on the 1960s now suggest the past becomes historysooner than we often expect

2 A critical point advanced in the field of literary criticism and theory byStephen Ramsay Reading Machines Towards an Algorithmic Criticism(Urbana University of Illinois Press 2011)

3 See ldquoIdleNoMore Timelinerdquo ongoingwwwidlenomoremakookcatimeline44 ltviewed 18 January 2013gt

4 These issues have been discussed elsewhere albeit in either a moreabridged form or with a focus on digitized historical sources A notableexception is Daniel J Cohen and Roy Rosenzweig Digital History AGuide to Gathering Preserving and Presenting the Past on the Web(Philadelphia University of Pennsylvania Press 2005) specifically theirchapter on ldquoCollecting History onlinerdquo Their focus is on the preservation-ist and archival challenges however Important notes are also found inDan Cohen ldquoHistory and the Second Decade of the Webrdquo RethinkingHistory 8 no 2 (June 2004) 293ndash301 also available online atwwwchnmgmueduessays-on-history-new-mediaessaysessayid=34

5 Figure taken from Jose Martinez ldquoTwitter CEO Dick Costolo Reveals Staggering Number of Tweets Per Dayrdquo ComplexTech wwwcomplexcomtech201210twitter-ceo-dick-costolo-reveals-staggering-number-of-tweets-per-day ltviewed 6 January 2013gtFurther more while a tweet itself is only 140 characters every tweetcomes with over 40 lines of metadata date information author biogra-phy and location information when the account was created the usersrsquofollowing and follower numbers country data and so forth This is aremarkable treasure trove of information for future historians but also apotential source of information overload See Sarah Perez ldquoThis is What

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

a Tweet Looks Likerdquo ReadWriteWeb (19 April 2010) wwwreadwritewebcomarchivesthis_is_what_a_tweet_looks_likephpltviewed 6 January 2013gt

6 This is further discussed in Ian Foster ldquoHow Computation ChangesResearchrdquo in Switching Codes Thinking Through Digital Technology inthe Humanities and the Arts eds Thomas Bartscherer and RoderickCoover (Chicago University of Chicago Press 2011) esp 22ndash4

7 Franco Moretti Graphs Maps Trees Abstract Models for Literary History(New York Verso 2005) 3ndash4

8 While not of prime focus here the longue dureacutee deserves definition Itstands opposite the history of events of instances covering instead avery long time span in an interdisciplinary social science framework In along essay Braudel noted that one can then see both structural crises ofeconomies and structures that constrain human society and develop-ment For more see Fernand Braudel ldquoHistory and the Social SciencesThe Longue Dureacuteerdquo in Fernand Braudel On History trans SarahMatthews (Chicago University of Chicago Press 1980) 27 The essaywas originally printed in the Annales ESC no 4 (OctoberndashDecember1958)

9 A great overview can be found at ldquoText Analysisrdquo Tooling Up for DigitalHumanities 2011 wwwtoolingupstanfordedupage_id=981 ltviewed19 December 2012gt

10 Jean-Baptiste Michel Yuan Kui Shen Aviva Presser Aiden Adrian VeresMatthew K Gray The Google Books Team Joseph P Pickett DaleHoiberg Dan Clancy Peter Norvig Jon Orwant Steven Pinker MartinA Nowak Erez Lieberman Aiden ldquoQuantitative Analysis of CultureUsing Millions of Digitized Booksrdquo Science 331 (14 January 2011)176ndash82 The tool is publicly accessible at ldquoGoogle Books NgramViewerrdquo Google Books 2010 wwwbooksgooglecomngrams ltviewed11 January 2013gt

11 Anthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

12 Comment by Jean-Baptiste Michel and Erez Lieberman Aiden onAnthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

13 For more information see the Canadian Century Research InfrastructureProject website at wwwccriuottawacaCCRIHomehtml and the Programme de recherche en deacutemographie historique at wwwgeneologyumontrealcafrleprdhhtm

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

14 ldquoWith Criminal Intentrdquo CriminalIntentorg wwwcriminalintentorgltviewed 20 December 2012gt

15 For an overview see ldquoAward Recipients mdash 2011rdquo Digging into DataWebpage wwwdiggingintodataorgHomeAwardRecipients2011tabid185Defaultaspx ltviewed 20 December 2012gt

16 One can do no better than the completely free Programming Historianwhich uses entirely free and open-source programming tools to learn thebasics of computational history See The Programming Historian 2 2012-Present wwwprogramminghistorianorg ltviewed 11 January 2013gt

17 Michael Katz The People of Hamilton Canada West Family and Class ina Mid-Nineteenth-Century City (Cambridge MA Harvard UniversityPress 1975) Also see A Gordon Darroch and Michael D OrnsteinldquoEthnicity and Occupational Structure in Canada in 1871 The VerticalMosaic in Historical Perspectiverdquo Canadian Historical Review 61 no 3(1980) 305ndash33

18 Ian Anderson ldquoHistory and Computingrdquo Making History The ChangingFace of the Profession in Britain 2008 wwwhistoryacukmakinghistoryresourcesarticleshistory_and_computinghtml ltviewed 20 December2012gt

19 Christopher Collins ldquoText and Document Visualizationrdquo TheHumanities and Technology Camp Greater Toronto Area VisualizationBootcamp University of Ontario Institute of Technology Oshawa Ont21 October 2011

20 The quotations by Luther and Poe are commonly used to make thispoint For the best example see Clay Shirky Cognitive SurplusCreativity and Generosity in a Connected Age Google eBook Penguin2010 For Mumford see Lewis Mumford The Myth of the Machine vol2 The Pentagon of Power (New York Harcourt Brace 1970) 182 asquoted in Gleick The Information A History A Theory A Flood (NewYork Pantheon 2011) 404

21 Adam Ostrow ldquoCisco CRS-3 Download Library of Congress in JustOver One Secondrdquo Mashable (9 March 2010) accessed onlinewwwmashablecom20100309cisco-crs-3

22 A note on data size is necessary The basic unit of information is a ldquobyterdquo (8 bits) which is roughly a single character A kilobyte (KB) is athousand bytes a megabyte (MB) a million a gigabyte (GB) a billion aterabyte (TB) a trillion and a petabyte (PB) a quadrillion Beyond thisthere are exabytes (EB) zettabytes (ZB) and yottabytes (YB)

23 These figures can of course vary depending on compression methodschosen Their math was based on an average of 8MB per book if it wasscanned at 600dpi and then compressed There are 26 million books inthe Library of Congress For more see Mike Ashenfelder ldquoTransferringlsquoLibraries of Congressrsquo of Datardquo Library of Congress Digital Preservation

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Blog 11 July 2011wwwblogslocgovdigitalpreservation201107transferring-libraries-of-congress-of-data ltviewed 11 January 2013gt

24 Gleick 232 25 ldquoAbout the Library of Congress Web Archivesrdquo Library of Congress Web

Archiving FAQs Undated wwwlocgovwebarchivingfaqhtmlfaqs_02ltviewed 11 January 2013gt

26 ldquo10000000000000000 Bytes Archivedrdquo Internet Archive Blog 26October 2012 wwwblogarchiveorg2012102610000000000000000-bytes-archivedltviewed 19 December 2012gt

27 Jean-Baptiste Michel et al 176ndash82 Published Online Ahead of Print16 December 2010 wwwsciencemagorgcontent3316014176ltviewed 29 July 2011gt See also Scott Cleland ldquoGooglersquos PiracyLiabilityrdquo Forbes 9 November 2011 wwwforbescomsitesscottcle-land20111109googles-piracy-liability ltviewed 11 January 2013gt

28 ldquoGovernment of Canada Web Archiverdquo Library and Archives Canada 17October 2007 wwwcollectionscanadagccawebarchivesindex-ehtmlltviewed 11 January 2013gt

29 For background see Bill Curry ldquoVisiting Library and Archives inOttawa Not Without an Appointmentrdquo Globe and Mail 1 May 2012wwwtheglobeandmailcomnewspoliticsottawa-notebookvisiting-library-and-archives-in-ottawa-not-without-an-appointmentarticle2418960 ltviewed 22 May 2012gt and the announcement ldquoLAC Begins Implementation of New Approach to service Deliveryrdquo Library and Archives Canada February 2012wwwcollectionscanadagccawhats-new013-560-ehtml ltviewed 22May 2012gt For professional responses see the ldquoCHArsquos comment onNew Directions for Library and Archives Canadardquo CHA Website March2012 wwwcha-shccaenNews_39items19html ltviewed 19December 2012gt and ldquoSave Library amp Archives Canadardquo CanadianAssociation of University Teachers launched AprilMay 2012 wwwsavelibraryarchivesca ltviewed 19 December 2012gt

30 Ian Milligan ldquoThe Smokescreen of lsquoModernizationrsquo at Library andArchives Canadardquo ActiveHistoryca 22 May 2012wwwactivehistoryca201205the-smokescreen-of-modernization-at-library-and-archives-canada ltviewed 19 December 2012gt

31 ldquoSave LAC May 2012 Campaign Updaterdquo SaveLibraryArchivesca May2012 wwwsavelibraryarchivescaupdate-2012-05aspx ltviewed 19December 2012gt

32 The challenges and opportunities presented by such projects is extensively discussed in Daniel J Cohen ldquoThe Future of Preserving the Pastrdquo CRM The Journal of Heritage Stewardship 2 no 2

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

(Summer 2005) 6ndash19 also available freely online at the RoyRosenzweig Center for History and New Mediawwwchnmgmueduessays-on-history-new-mediaessaysessayid=39For more see ldquoSeptember 11 Digital Archive Saving the Histories of September 11 2001rdquo Centre for History and New Media andAmerican Social History Project Center for Media and Learningwww911digitalarchiveorg ltviewed 11 January 2013gt ldquoHurricaneDigital Memory Bank Collecting and Preserving the Stories of Katrina and Ritardquo Center for History and New Media wwwhurricanearchiveorg ltviewed 11 January 2013gt and ldquoOccupyArchive Archiving the Occupy Movements from 2011rdquo Centre forHistory and New Media wwwoccupyarchiveorg ltviewed 11 January2013gt

33 Daniel Cohen and Roy Rosenzweig Digital History A Guide to Gathering Preserving and Presenting the Past on the Web(Philadelphia Temple University Press 2005) chapter on ldquoPreservingDigital Historyrdquo The chapter is available fully online atwwwchnmgmuedudigitalhistorypreserving5php ltviewed 19December 2012gt

34 For an overview see ldquoPreserving Our Digital Heritage The NationalDigital Information Infrastructure and Preservation Program 2010 Report a Collaborative Initiative of the Library of CongressrdquoDigitalpreservationgov wwwdigitalpreservationgovmultimediadocumentsNDIIPP2010Report_Postpdf ltviewed 19 December 2012gt

35 Steve Meloan ldquoNo Way to Run a Culturerdquo Wired 13 February 1998available online via the Internet Archivewwwwebarchiveorgweb20000619001705httpwwwwiredcomnewsculture012841030100html ltviewed 19 December 2012gt

36 Some of this has been discussed in greater depth on my website See theseries beginning with Ian Milligan ldquoWARC Files A Challenge forHistorians and Finding Needles in Haystacksrdquo ianmilliganca 12December 2012 wwwianmilliganca20121212warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks ltviewed 20 December2012gt

37 Unstructured data as opposed to structured data is important for histo-rians but brings additional challenges Some projects have producedstructured data novels marked up using XML encoding for example soone can immediately see place names dates character names and soforth This makes it much easier for a computer to read and understand

38 See William Noblett ldquoDigitization A Cautionary Talerdquo New Review ofAcademic Librarianship 17 (2011) 3 Tobias Blanke Michael Bryantand Mark Hedges ldquoOcropodium Open Source OCR for Small-ScaleHistorical Archivesrdquo Journal of Information Science 38 no 1 (2012) 83

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 12: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

this continuous context of handwringing around the expansion andrise of information

Useful historical information is being preserved at a rate that isaccelerating with every passing day One useful metric to comparedataset size is that of the American Library of Congress (LOC)Indeed a ldquoLOCrdquo has become shorthand for a unit of measurementin the digital humanities Measuring the actual extant of the printedcollection in terms of data is notoriously difficult James Gleick hasclaimed it to be around 10TB22 whereas the Library of Congressitself claims a more accurate figure is about 200TB23 While thesetwo figures are obviously divergent neither is on an unimaginablescale mdash personal computers now often ship with a terabyte or moreof storage When Claude Shannon the father of information theorycarried out a thought experiment in 1949 about items or places thatmight store ldquoinformationrdquo (an emerging field which viewed messagesin their aggregate) he scaled sites from a digit upwards to the largestcollection he could imagine The list scaled up logarithmically from10 to the power of 0 to 10 to the power of 13 If a single-spaced pageof typing is 104 107 the Proceedings of the Institute of Radio Engineers109 the entire text of the Encyclopaedia Brittanica then 1014 was ldquothelargest information stockpile he could think of the Library ofCongressrdquo24

But now the Library of Congressrsquo print collection mdash somethingthat has taken two centuries to gather mdash is dwarfed by their newestcollection of archived born-digital sources the vast majority only amatter of years old compared to the much wider date-range of non-digital sources in the traditional collection The LOC has beguncollecting its own born-digital Internet archive multiple snapshotsare taken of webpages to show how they change over time Even asa selective curated collection drawing on governmental politicaleducational and creative websites the LOC has already collected254TB of data and adds 5TB a month25 The Internet Archivethrough its Wayback Machine is even more ambitious It seeks toarchive every website on the Internet While its size is also hard toput an exact finger on as of late 2012 it had over 10 Petabytes ofinformation (or to put it into perspective a little over 50 Library ofCongresses if we take the 200TB figure)26

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

This is to some degree comparing apples and oranges theLibrary of Congress is predominantly print whereas the InternetArchive has considerable multimedia holdings which includesvideos images and audio files The LOC is furthermore a morecurated collection whereas the Internet Archive draws from a widerrange of producers The Internet offers the advantages and disadvan-tages of being a more democratic archive For example a Geocitiessite (an early way for anybody to create their own website) created bya 12-year old Canadian teenager in the mid-1990s might be pre-served by the Internet Archive whereas it almost certainly would notbe saved by a national archive This difference in size and collectionsmanagement however is at the root of the changing historiansrsquotoolkit If we make use of these large data troves we can access a newand broad range of historical subjects

Conventional sources are also part of this deluge Google Bookshas been digitizing the printed word and it currently has 15 millionworks completed by 2020 it audaciously aims to digitize every bookever written While copyright issues loom the project will allowresearchers to access aggregate word and phrase data as seen in theGoogle n-gram project27 In Canada Library and Archives Canada(LAC) has a more circumscribed digitization project collecting ldquoarepresentative sample of Canadian websitesrdquo focusing particularlyon Government of Canada webpages as well as a curated approach

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

preserving case studies such as the 2006 federal election Currentlyit has 4TB of federal government information and 7TB of totalinformation in its larger web archive (a figure which as I note belowis unlikely to drastically improve despite ostensible goals of archivallsquomodernizationrsquo)28

In Canada LAC has mdash on the surface mdash has been advancingan ambitious digitization program as it realizes the challenges nowfacing archivists and historians Daniel J Caron the Librarian andArchivist of Canada has been outspoken on this front with severalpublic addresses on the question of digital sources and archivesThroughout 2012 primary source digitization at LAC preoccupiedcritics who saw it as a means to depreciate on-site access TheCanadian Association of University Teachers the Canadian HistoricalAssociation as well as other professional organizations mountedcampaigns against this shift in resources29 LAC does have a pointhowever with their modernization agenda The state of Canadarsquosonline collections are small and sorely lacking when compared totheir on-site collections and LAC does need to modernize and achievethe laudable goal of reaching audiences beyond Ottawa

Nonetheless digitization has unfortunately been used as asmokescreen to depreciate overall service offerings30 The modern-ization agenda would also see 50 percent of the digitization staffcut31 LACrsquos online collections are small they do not work withonline developers through Application Programming Interfaces(APIs a way for a computer program to talk to another computerand speed up research mdash Canadianaca is incidentally a leader in thisarea) and there have been no promising indications that this state ofaffairs will change Digitization has not been comprehensive Onlinepreservation is important and historians must fight for both on-siteand on-line access

A number of historians have recognized this challenge and areplaying an instrumental role in preserving the born-digital past TheRoy Rosenzweig Center for History and New Media at GeorgeMason University launched several high-profile preservation pro-jects the September 11th Digital Archive the Hurricane DigitalMemory Bank (focusing on Hurricanes Katrina and Rita) and as ofwriting the Occupy archive32 Massive amounts of online content

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

is curated and preserved photographs news reports blog posts andnow tweets These complement more traditional efforts of collectingand preserving oral histories and personal recollections which arethen geo-tagged (tied to a particular geographical location) tran-scribed and placed online These archives serve a dual purposepreserving the past for future generations while also facilitating easydissemination for todayrsquos audiences

Digital preservation however is a topic of critical importanceMuch of our early digital heritage has been lost Many websites espe-cially those before the Internet Archiversquos 1996 web archiving projectlaunch are completely gone Early storage mediums (from tape toearly floppy disks) have become obsolete and are now nearly inac-cessible Compounding this software development has seen earlyproprietary file formats depreciated ignored and eventually madewholly inaccessible33 Fortunately the problem of digital preservationwas recognized by the 1990s and in 2000 the American governmentestablished the National Digital Information Infrastructure andPreservation Program34 As early as 1998 digital preservationistsargued ldquo[h]istorians will look back on this era hellip and see a period ofvery little information A lsquodigital gaprsquo will span from the beginningof the wide-spread use of the computer until the time we eventuallysolve this problem What wersquore all trying to do is to shorten thatgaprdquo To bear home their successes that quotation is taken from anInternet Archive snapshot of Wired magazinersquos website35

Indeed greater attention towards rigorous metadata (which isbriefly put data about data often preserved in plain text format)open-access file formats documented by the InternationalOrganization for Standardization (ISO) and dedicated digitalpreservation librarians and archivists gives us hope for the future36

It is my belief that cloud storage the online hosting of data in largethird-party offsite centres mitigates the issue of obsolete storagemedium Furthermore that the move towards open source file for-mats even amongst large commercial enterprises such as Microsoftprovides greater documentation and uniformity

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A Brief Case Study Accessing and Navigating Canadarsquos DigitalCollections With an Emphasis on Tools for Born-DigitalSources

What can we do about this digital deluge There are no simpleanswers but historians must begin to conceptualize new additions totheir traditional research and pedagogical toolkits In the section thatfollows I will do two things introduce a case study and several toolsboth proprietary and those that are open and available off-the-shelfto reveal one way forward

For a case study I have elected to use Canadarsquos DigitalCollections (CDC) a collection of lsquodeadrsquo websites removed fromtheir original location on the World Wide Web and now digitallyarchived at LAC In this section I will introduce the collection usingdigitally-archived sources from the early World Wide Web and dis-cuss my research methodology before illustrating how we can derivemeaningful data from massive arrays of unstructured data37

This is a case study of a web archive Several of these method-ologies could be applied to other formats including large arrays oftextual material whether born-digital or not A distant reading of thistype is not medium specific It does however assume that the text ishigh quality Digitized primary documents do not always have textlayers and if they are instituted computationally through OpticalCharacter Recognition errors can appear Accuracy rates for digitizedtextual documents can range from as low as 40 percent for seven-teenth and eighteenth century documents to 80ndash90 percent formodern microfilmed newspapers to over 99 percent for typesetword-processed documents38 That said while historians today maybe more interested in applying distant reading methodologies tomore traditional primary source materials it is my belief that in thenot-so-distant future websites will be an increasingly significantsource for social cultural and political historians

What are Canadarsquos Digital Collections In 1996 IndustryCanada funded by the federal Youth Employment Strategy facili-tated the development of Canadian heritage websites by youth(defined as anyone between the age of 15 and 30) It issued contractsto private corporations that agreed to use these individuals as web

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

developers for the twofold purpose of developing digital skills andinstilling respect for and engagement with Canadian heritage andpreservation39 The collection initially called SchoolNet butrebranded as Canadarsquos Digital Collections in 1999 was accessiblefrom the CDC homepage After 2004 the program was shut downand the vast majority of the websites were archived on LACrsquos web-page This collection is now the largest born-digital Internet archivethat LAC holds worthy of study as content as well as thinkingabout methodological approaches

Conventional records about the program are unavailable in thetraditional LAC collection unsurprising in light of both its recentcreation and backlogs in processing and accessioning archival mate-rials We can use one element of our born-digital toolkit the InternetArchiversquos WaybackMachine (its archive of preserved early webpages)to get a sense of what this project looked like in its heyday The mainwebpage for the CDC was wwwcollectionsicgc However a visitortoday will see nothing but a collection of broken image links (two ofthem will bring you to the English and French-language versions ofthe LAC-hosted web archive although this is only discerniblethrough trial and error or through the sitersquos source code)

With the WaybackMachine we have a powerful albeit limitedopportunity to ldquogo back in timerdquo and experience the website as previ-ous web viewers would have (if a preserved website is available ofcourse) Among the over 420 million archived websites we can searchfor and find the original CDC site Between 12 October 1999 andthe most recent snapshot or web crawl (of the now-broken site) on 6July 2011 the site was crawled 410 times40 Unfortunately every siteversion until the 15 October 2000 crawl has been somehow cor-rupted Content has been instead replaced by Chinese characterswhich translate only into place names and other random words Issuesof early digital preservation noted above have reared their head inthis collection

From late 2000 onward however fully preserved CDC sites areaccessible Featured sites new entries success stories awards curric-ular units and prominent invitations to apply for project fundingpopulate the brightly coloured website The excitement is palpableJohn Manley then Industry Minister declared ldquoit is crucial that

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Canadarsquos young people get the skills and experience they need to par-ticipate in the knowledge-based economy of the futurerdquo and thatCDC ldquoshows that young people facing significant employment chal-lenges can contribute to and learn from the resources available onthe Information Highwayrdquo41 The documentation reflecting thelower level of technical expertise and scarcity of computer hardwareat the time was comprehensive participants were encouraged toidentify a collection digitize it process it (by cleaning up the data)preserve it on a hard drive and display it on the web Issues of copy-right were discussed as well as preservation issues and hardwarerequirements for computers (a computer with an Intel 386 processor8MB RAM a 1GB hard drive and 288bps modem was recom-mended) and training information for employers educators andparticipants42 Sample proposals community liaison ideas amidstother information set out a very comprehensive roadmap for a grass-roots bottom-up approach to digital preservation and heritage

Moving through the preserved websites is an interesting snap-shot at how web design standards changed Colour disappearsreplaced by a more subdued coordinated federal website standard(still largely in effect today) Surveys greet viewers from 2003onwards trying to learn how viewers discovered the site whetherthey think it is a good educational resource (they must have beenrelieved that the majority thought so mdash in 2003 some 84 percentfelt that it was a ldquogoodrdquo resource the best of its type)43 By 2004 theprogram was engaging in active remembrance of its past programsas it neared its ten-year anniversary Proposals stopped beingaccepted in July 2004 but the 600 webpages were still accessiblewith a comprehensive subject index alphabetical listing success sto-ries and detailed copyright information

The Canadarsquos Digital Collections site began to move towardsremoval in November 2006 and was initially completely gone A visi-tor to the site would learn that the content ldquois no longer availablerdquo andan e-mail address was provided for contributors to get their informationback44The information was completely inaccessible and by early 2007a bold notice announced that the now-empty site would close as of 2April 2007 Luckily by April 2007 a notice appeared with a declara-tion that the collections were now archived by Library and Archives

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Canada just in time for the website to completely disappear by July2007 In the archived version subject indexes and headings were goneAll that remained then and today for that matter is a straight-forwardalphabetical list and a disclaimer that noted ldquo[y]ou are viewing a doc-ument archived by Library and Archives Canada Please noteinformation may be out of date and some functionality lostrdquo

The question now is what we can do with this collection ofwebsites Could a distant reading as opposed to a close reading ona targeted level tell us something about this collectionFortunately we have a variety of tools that can help us navigate thislarge array of information several of which I use to navigate largeamounts of data First I create my own tools through computerprogramming This work is showcased in the following section I program in Wolfram Researchrsquos Mathematica 9 Mathematica is anintegrated platform for technical computing that allows users toprocess visualize and interact with exceptional arrays of informa-tion45 It is a language with an engaged and generous user base thatcomplements exhaustive well-written documentation fromWolfram Research To emphasize its ease one line of code couldimport a website in perfectly readable plaintext (avoiding unicodeerrors) as follows input=Import[ldquohttpactivehis-torycaldquordquoPlaintextrdquo] and a second one could make itall in lower text for textual analysis as such lowerpage=ToLowerCase[input] Variables are a critical building blockIn the former example ldquoinputrdquo now holds the entire plaintext of thewebsite ActiveHistoryca and then ldquolowerpagerdquo holds it in lowercaseOnce we have lower-cased information we are quickly able to extractthe following frequent words frequent phrases and could thus cre-ate a dynamic database of n-grams word clouds or run other textualanalysis programs on the material Much of the syntax adheres to arecognizable English-language format

Even more promising is Mathematicarsquos integration with theWolfram|Alpha database (itself accessible at wolframalphacom) acomputational knowledge engine that promises to put incredibleamounts of information at a userrsquos fingertips With a quick com-mand one can access the temperature at Torontorsquos Lester B PearsonInternational Airport at 500 pm on 18 December 1983 the rela-

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

tive popularity of names such as ldquoEdithrdquo or ldquoEmmardquo the Canadianunemployment rate in March 1985 or the Canadian GrossDomestic Product in December 199546 This allows one to quicklycross-reference findings with statistical information or find coinci-dental points of comparison As universities are now being activelyencouraged to contribute their information to Wolfram|Alpha thiswill become an increasingly powerful database

For many researchers the downside of Mathematica is a con-siderable one It is proprietary software and relatively expensive forstudent and non-student alike Outputs however can be dissemi-nated using the Computational Document Format (CDF) enablinghistorians to put their work online and entice the general public todynamically work with their models and findings

Programming is not the be all and end all Information visual-ization in the humanities is a growing international field withCanadian scholars heavily involved The most important and acces-sible example is Voyant Tools47 It is a powerful suite of textualanalysis tools Uploading onersquos own text or collection of texts draw-ing on a webpage or analyzing existing bodies of work one sees thefollowing a ldquoword cloudrdquo of a corpus click on any word to see itsrelative rise and fall over time (word trends) access the original textand click on any word to see its contextual placement within indi-vidual sentences It ably facilitates the ldquodistancerdquo reading of anyreasonably-sized body of textual information and should be a begin-ning point for those interested in this field The only downside isthat Voyant can choke on very large amounts of information as it isa textual analysis rather than robust data mining tool

There are several other complementary tools A notable one is theMany Eyes website hosted by IBM Research Many Eyes allows you toupload datasets be they textual tabular or numerical There are manyvisualization options provided you can plot relationships betweendata compare values (ie bar charts and bubble charts) track dataover time and apply information to maps The only downside toMany Eyes is that you must register and make your datasets available toall other registered researchers as well as IBM This limits your abilityto use copyrighted or proprietary information

More advanced technical users can avail themselves of a host of

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

textual analysis programs found elsewhere on the web The mostcomprehensive gathering of these tools can be found with theSoftware Environment for the Advancement of Scholarly Research(SEASR) and its Meandre workbench Meandre which can beinstalled as a local server on your system or mdash in a best case scenariomdash hosted by an institutional IT department allows you to tap intoa variety of workflows to analyze data48 One workflow mentionedin this article is the topic modeling toolkit found within the MachineLearning for Language Toolkit (MALLET) package which clustersunstructured data into ldquotopicsrdquo which appear together giving a quicksense of a documentrsquos structure49 Other workflows found withinSEASR include Dunningrsquos log-likelihood statistic which compares aselected document to a broader reference corpus For example youcould compare the lyrics that appear in 1968 against all lyrics in thepostwar era and learn what makes 1968 most unique50 sentimentanalyses which attempt to plot changes in emotional words andphrases over time and a variety of entity extraction algorithms(which find ldquopeoplerdquo ldquoplacesrdquo and so forth in a text) Corpuses thatwere previously too big to extract meaningful information from cannow be explored relatively quickly

As befits a technical side to history all of the freely availabletools discussed briefly here (notably wget SEASR and MALLET)are discussed in more technical terms in the peer-reviewedProgramming Historian 2 or on my own blog at ianmilliganca I haveelected to leave them online rather than provide commands in thisarticle they require some technical knowledge and directly workingfrom the web is quicker than here Please consider the following sec-tion an applied version of some of the lessons you can learn throughthese processes

From Web Collection to Dynamic Finding Aids

Let us briefly return to our archival metaphor Todayrsquos historiandealing with conventional sources often has to physically travel to anarchive requisition boxes perhaps take photographs of materials iftime is tight and otherwise take notes With born-digital sourceshowever the formula is reversed Our sources can come to us In a

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

few keystrokes we can bring an unparalleled quantity of informationto our home or work computers With Canadarsquos Digital Collectionsa single command line entry brought the entire archive onto my sys-tem it took days to process the information and I had to leave mycomputer on but it did not require any active user engagement

How big is the CDC collection in terms of sheer informationThe plaintext without any HTML markup is approximately7805135 words If each page contained 250 words it would be theequivalent of a 31220 page essay or 1000 dissertations However itis even more complicated than just reading 1000 dissertations Theinformation is spread across 73585 separate HTML files themselvespart of a much broader corpus of 278785 files Even the sheer sizeof the database requires big data methodologies an aggregate size of726 GB 360 MB of plaintext with HTML encoding reduced to alsquomerersquo 50MB if all encoding is removed A 50MB plain text docu-ment can quickly slow down even a relatively powerful personalcomputer There is no simple way to make this a conventionallyaccessible amount of information

How can we access this information Several different methodsare possible First we can work with the information on the internetitself or we can download it onto our own system I prefer the lattermethod as it decreases computational time adds flexibility andenables free rein to experiment with material Before you do thisthough you need to be familiar with the websitersquos ldquoterms of servicerdquoor the legal documentation that outlines how users are allowed to usea website resources or software In this case LAC provides generousnon-commercial reproduction terms information ldquomay be repro-duced in part or in whole and by any means without charge orfurther permission unless otherwise specifiedrdquo51

The CDC web archive index is organized like so the index islocated at wwwepelac-bacgcca100205301iccdcEAlphabetaspwith every subsequent website branching off the cdc folder Forexample one website would be hosted in wwwepelac-bacgcca100205301iccdcABbenevolat the next in hellipcdcabnatureand so on Using wget a free software program you can downloadall websites in a certain structure the program recursively downloadsinformation in a methodical fashion I first used wget to save the

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

cookies (little files kept on your system in this case one that notedthat you had seen the disclaimer about the archival nature of thewebsite) and then with an one-line command in my command lineto download the entire database52 Due to the imposition of a ran-dom wait and bandwidth limit which meant that I would downloadone file every half to one and a half seconds at a maximum of 30kilobytes a second the process took six days In the ProgrammingHistorian 2 I have an article expanding upon wget for humanitiesresearchers

Hypertext itself deserves a brief discussion The basic buildingblock of the web is the link one page is connected to another andpeople navigate through these webs While books and articles forexample are connected to each other by virtue of ideas and concretecitations these must often be inferred online HTML born-digitalsources are different in that the link offers a very concrete file One

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Visualizing the CDC Collection as a Series of Hyperlinks

quick visualization the circular cluster seen above lets us see theunique nature of this information We can go to each web pagescrape the links and see where they take viewers As we see in thebelow figure the CDC collection can be visualized as a series ofhyperlinks (each line) The cluster in the middle is the CDC collec-tion itself whereas the strands on the outside represent externalwebsites Using Mathematica I have generated a dynamic modelyou can move your mouse over each line learn what link it is andthus see where the collection fits into the broader Internet The siteis limited to following two links beyond the first page Through thisapproach we see the interconnection of knowledge and relationshipsbetween different sources At a glimpse we can see suggestions to fol-low other sites and determine if any nodes are shared amongst thevarious websites revealed

Visualization Moving Beyond Word Clouds

Where to begin with a large collection such as the Canadarsquos DigitalCollection corpus As a historian who is primarily interested in thewritten word my first step was to isolate the plaintext throughout thematerial so that I could begin rudimentary data mining and textualanalysis on the corpus For this I approached the CDC collection ontwo distinct levels mdash first the aggregate collection and second divid-ing the collection into each lsquofilersquo or individual website mdash derivingword and phrase data for each Imagine if you could look throughinto an archival collection and get a quick sense of what topics andideas were covered Taking that thought further imagine the utility ifyou could do this with individual archival boxes and even files Witha plethora of born-digital sources this is now a necessary step

On an aggregate level word and phrase data is a way to explorethe information within How often do various words within thenearly eight million words of the collection appear UsingMathematica we can quickly generate a list of words sorted by theirrelative frequency we learn that the word ldquoCanadardquo appears 26429times ldquocollectionsrdquo 11436 and ldquo[L]ouisbourgrdquo an astounding9365 times Reading such a list can quickly wear out a reader and isof minimal visual attractiveness Two quick decisions were made to

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

enhance data usability first the information was ldquocase-foldedrdquo allsent to lower-case this enables us to count the total appearances of aword such as ldquodigitalrdquo regardless of whether or not it appears at thebeginning of a sentence

Second when dealing with word frequency mdash as opposed tophrase frequency mdash ldquostop wordsrdquo should be removed as they werehere These are ldquoextremely common words which would appear to beof little value in helping select documents matching a user needrdquowords such as ldquotherdquo ldquoandrdquo ldquotordquo and so forth They matter forphrases (the example ldquoto be or not to berdquo is comprised entirely ofstop words) but will quickly dominate any word-level visualizationor list53 Other words may need to be added or removed from stoplists depending on the corpus for example if the phrase ldquopage num-berrdquo appeared on every page that would be useless information forthe researcher

The first step then is to use Mathematica to format word fre-quency information into Wordlenet readable format (if we put allof the text into that website it will crash mdash so we need to followthe instructions on that website) we can then draw up a wordcloud of the entire collection Here we can see both the advantagesand disadvantages of static word clouds What can we learn from this

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A Word Cloud of All Words in the CDC Corpus

illustration First without knowing anything about the sites we canimmediately learn that it is dealing with CanadaCanadians that itis a collection with documents in French and English that we aredealing with digitized collections and that there is a fair amount ofregional representation not based upon population (note thatldquoontariordquo is less frequent than both ldquonewfoundlandrdquo ldquonova scotiardquoand ldquoalbertardquo although ldquoqueacutebecrdquo and ldquoquebecrdquo is not terribly promi-nent) Certain kinds of history dominate communities familynames the built form (houses churches schools and so forth) At avery macro level we can learn several characteristics before doing anydeeper investigation

These forms of visualization need to be used with extreme cau-tion however First while visually attractive it leaves a severeamount of ambiguity are ldquoontariordquo and ldquoqueacutebecrdquo equally promi-nent for example or is there a slight divergence Even moreproblematically we cannot learn context from the above picturedoes the word ldquoschoolrdquo refer to built structures or educational expe-riences The same question with applies to ldquochurchrdquo Do mentionsof ldquonewfoundlandrdquo refer to historical experiences there or withexpatriates or in travel narratives Word clouds are useful first stepsand have their place but need to be used with extreme caution Formore scholarly applications tables offer much more precise inter-pretation As the second figure below demonstrates they can bemore helpful when dealing with the individual sites cuttingthrough ambiguity The figure below is a visualization of the web-site ldquopast and presentrdquo Despite the ambiguous title we can quicklylearn what this website contains from this visualization Wordle isthus a useful first step analogous to taking magically taking a quicklook through the side of an archival box We still need more infor-mation however to really make this form of textual analysishelpful To navigate this problem I have created my own dynamicfinding aid program using Mathematica which scales to any born-digital web collection

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Two approaches were taken First a textual representation of wordand phrase frequency allows us to determine the content of a given filerelatively quickly I wrote a short program to extract word frequency(stop words excluded) and phrase frequency For phrases we speak ofn-grams bigrams trigrams quadgrams and fivegrams although onecould theoretically do any higher number A bigram can be a combina-tion of two characters such as lsquobirsquo two syllables or mdash pertinent in ourcase mdash two words An example ldquocanada isrdquo A trigram is three wordsquadgram is four words and a fivegram is five words and so on

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Word Cloud of the File ldquoPast and Presentrdquo

A viewer for n-grams in the CDC Collection

The CDC collection demonstrates this approachrsquos utility Usingmy Mathematica textual CDC browser pictured above I randomlyexplored file 267 the ambiguously named ldquopast and presentrdquo web-site This is one of the largest individual websites 122363 words orabout 493 standard typed pages Such an ambiguous name could beabout almost anything Initially I decided to study word frequency

ldquoldquoalbertardquo 758rdquo ldquoldquolistenrdquo 490rdquo ldquoldquosettlersrdquo 467rdquoldquoldquonewrdquo 454rdquo ldquoldquocanadardquo 388rdquo ldquoldquolandrdquo 362rdquo ldquoldquoreadrdquo328rdquo ldquoldquopeoplerdquo 295rdquo ldquoldquosettlementrdquo 276rdquo ldquoldquochineserdquo268rdquo ldquoldquocanadianrdquo 264rdquo ldquoldquoplacerdquo 255rdquo ldquoldquoheritagerdquo254rdquo ldquoldquoedmontonrdquo 250rdquo ldquoldquoyearsrdquo 238rdquo ldquoldquowestrdquo 236rdquoldquoldquoraymondrdquo 218rdquo ldquoldquoukrainianrdquo 216rdquo ldquoldquocamerdquo 213rdquoldquoldquocalgaryrdquo 209rdquo ldquoldquonamesrdquo 206rdquo ldquoldquoranchrdquo 204rdquoldquoldquoupdatedrdquo 193rdquo ldquoldquohistoryrdquo 190rdquo ldquoldquocommunitiesrdquo183rdquo hellip

Rather than having to visit the website at a glance we can almostimmediately see what the site is about More topics become apparentwhen one looks at trigrams for example in this automatically gen-erated output of generated top phrases

ldquoldquofirstrdquo ldquopeoplerdquo ldquosettlersrdquo 100rdquo ldquoldquotherdquo ldquounitedrdquoldquostatesrdquo 72rdquo ldquoldquoonerdquo ldquoofrdquo ldquotherdquo 59rdquo ldquoldquoofrdquo ldquotherdquoldquowestrdquo 58rdquo ldquoldquosettlersrdquo ldquolastrdquo ldquoupdatedrdquo 56rdquo ldquoldquopeoplerdquoldquosettlersrdquo ldquolastrdquo 56rdquo ldquoldquoalbertansrdquo ldquofirstrdquo ldquopeoplerdquo 56rdquoldquoldquoadventurousrdquo ldquoalbertansrdquo ldquofirstrdquo 56rdquo ldquoldquotherdquo ldquobarrdquoldquourdquo 55rdquo ldquoldquoopeningrdquo ldquoofrdquo ldquotherdquo 54rdquo ldquoldquocommunitiesrdquoldquoadventurousrdquo ldquoalbertansrdquo 54rdquo ldquoldquonewrdquo ldquocommunitiesrdquoldquoadventurousrdquo 54rdquo ldquoldquobarrdquo ldquourdquo ldquoranchrdquo 53rdquo ldquoldquolistenrdquoldquotordquo ldquotherdquo 52rdquo hellip

Through a few clicks we can navigate and determine the basiccontours mdash from a distance mdash of what the site pertains to In aviewer you can use a slider to dynamically see more informationor less information quickly toggling between bigrams trigramsand quadgrams allowing you to quickly hone in on immediately

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

relevant information This can be useful but we need to moveback even further to achieve a distant reading of this born-digitalcollection

If one zooms back out to this very distant level we can extractn-gram data from the entire collection We have two approachesfirst aggregate appearances of a given term (if ldquoaboriginalrdquo appears457 times then n=457) or second relative appearances of a giventerm (if ldquoaboriginalrdquo appears in 45 of 384 sites then n=45384 or01171875) In my other work on music lyrics the latter optionenables you to control for the lsquochorusrsquo effect as the former can be drowned out by one song with repetitive words Given thenature of this collection I elected for the former approach Whatcan we learn from trigrams for example After an initial set of trigrams pertaining to the Louisbourg digitization project whichhas boilerplate language on each page as well as common ldquostop-wordrdquo equivalents of phrases (especially pronounced in French) we come across the first set of meaningful phrases excerptedbelow

ldquoexperiencerdquo ldquoinrdquo ldquotherdquo 919 ldquotransferredrdquo ldquotordquo ldquotherdquo918 ldquofundedrdquo ldquobyrdquo ldquotherdquo 918 ldquoofrdquo ldquotherdquo ldquoprovincialrdquo915 ldquothisrdquo ldquowebrdquo ldquositerdquo 915 ldquo409rdquo ldquoregistryrdquo ldquo2rdquo914 ldquoarerdquo ldquonordquo ldquolongerrdquo 914 ldquogouvernementrdquo ldquodurdquo ldquocanadardquo 910 ldquouserdquo ldquothisrdquo ldquolinkrdquo 887ldquopagerdquo ldquoorrdquo ldquoyourdquo 887 ldquofollowingrdquo ldquopagerdquo ldquoorrdquo 887 ldquoautomaticallyrdquo ldquotransferredrdquo ldquotordquo 887 ldquoberdquoldquoautomaticallyrdquo ldquotransferredrdquo 887 ldquowillrdquo ldquoberdquo ldquoautomaticallyrdquo 887 ldquoindustryrdquo ldquoyourdquo ldquowillrdquo 887ldquomultimediardquo ldquoindustryrdquo ldquoyourdquo 887 ldquopracticalrdquo ldquoexperiencerdquo ldquoinrdquo 887 ldquogainedrdquo ldquopracticalrdquo ldquoexperiencerdquo887 ldquotheyrdquo ldquogainedrdquo ldquopracticalrdquo 887 ldquowhilerdquo ldquotheyrdquoldquogainedrdquo 887 ldquocanadiansrdquo ldquowhilerdquo ldquotheyrdquo 887ldquoyoungrdquo ldquocanadiansrdquo ldquowhilerdquo 887 ldquoshowcaserdquo ldquoofrdquoldquoworkrdquo 887 ldquoexcellentrdquo ldquoshowcaserdquo ldquoofrdquo 887 hellip

Here we see an array of common language funding informa-tion link information project information spread across over 887

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

websites Reading down we see information about governments jobexperience project information initiatives re-structuring and soforth At a glance we can see what language is common to all web-sites as opposed to unique information Moving down we find themore common but less universal ldquothe history ofrdquo for example ldquoher-itage community foundationrdquo libraries and so forth By the end ofthe list we can move into the unique or very rare phrases that appearonly once twice or three times

Moving beyond word and phrase frequency there are a fewmore sophisticated off-the-shelf options that can help us make senseof large arrays of unstructured data Topic modeling is among themost developed as Princetonrsquos David M Blei (a Computer Scienceprofessor) describes in a general introduction to the topic

While more and more texts are available online we simplydo not have the human power to read and study them toprovide the kind of [traditional] browsing experiencedescribed above To this end machine learning researchershave developed probabilistic topic modeling a suite of algo-rithms that aim to discover and annotate large archives ofdocuments with thematic information Topic modelingalgorithms are statistical methods that analyze the words ofthe original texts to discover the themes that run throughthem how those themes are connected to each other andhow they change over time hellip Topic modeling enables usto organize and summarize electronic archives at a scalethat would be impossible by human annotation55

Using the aforementioned SEASR Meandre workbench (or viaMALLET itself and a command-line interface) we can quickly beginto run our born-digital information through the topic-modelingalgorithm The results are notable and come in two formats a wordcloud visualization as well as an XML file containing a line-by-linebreakdown of each topic Taking the aforementioned file 267 (ldquopastand presentrdquo) we can begin to get a sense of the topics this large web-site discusses

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

At a glance we can learn about some of the main themes the struc-ture of Alberta (Alberta towns bridges lakes) the geographic shapeof the area (life land lands areas links grass and interestingly lan-guage) the stories (story knight archives but also interestinglyknight and wife) and so forth The word cloud gives us a sense ofthe most relevant terms that define each topic but we also have a3200-line XML file broken into topics which lets us further refineour inquiry

A last set of examples will help us move from the aggregate levelof hundreds of thousands of lines of text to the individual line inessence creating our own search engine to navigate this collection

There are a few good reasons why we should not use Google todo all of this in our own work First we do not know how Googleworks precisely It is a proprietary black box While the original algo-rithm is available the actual methods that go into determining theranking of search terms is a trade secret Second Google is optimizedfor a number of uses mainly to deliver the most relevant informa-tion to the user Historians and archivists are looking often forspecific information within old websites and it behooves us to exam-ine them through the lens of more specialized tools With thesehesitations in mind I have sought to develop my own rudimentarysearch engine for this born-digital collection

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A two-pronged approach can take us to relevant informationquickly In the following example I am carrying out a project onblack history We quickly search all of the CDC word frequency filesfor the occurrence of the word ldquofreedomrdquo ldquoblackrdquo and ldquolibertyrdquo(these terms of course require some background knowledge thesearch is shaped by the user) obtain a list and then map it onto achart in order to quickly visualize their locations in the collection Forexample in the graph above we see a visualization of a given term inthis case our three search terms As one moves the mouse over eachbar the frequency data for a given term appears We can also zoomand enhance this graph to use it as we wish A demonstration of thiscan be found online at wwwianmilliganca20130118cdc-tool-1

We can then move down into the data In the example aboveldquo047rdquo labeled ldquoblackgoldtxtrdquo contains 31 occurrences of the wordldquoontariordquo Moving to my Mathematica keyword-in-context routinewe can see where these occurrences of the word appear

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

A snapshot of the dynamic finding aid tool

We see that this website focuses mostly on southwestern Ontariowhere many oil fields are If we move forward we can see that it per-tains to the nineteenth century a particular region as well assubsequent memorialization activities (mention of a museum forexample) All this can be done from within the textual client asopposed to going out into individual websites Broad contoursemerge from this distant reading but they also facilitate deeper read-ing by allowing us to promptly realize critical sources

Another model through which we can visualize documents isthrough a viewer primarily focusing on term document frequencyand inverse frequency terms discussed below To prepare this mater-ial we are combining our earlier data about word frequency (that theword ldquoexperiencerdquo appears 919 times perhaps) with aggregate infor-mation about frequency in the entire corpus (ie the wordldquoexperiencerdquo appears 1766 times) The number of times that a termappears in a given website can be defined as tf (term frequency)Another opportunity is inverse document frequency where we canlearn what words are most distinct and unique in a given documentThe formula for this is the oft-cited tf-idf a staple of informationretrieval theory or tf times the logarithm of the number of docu-ments divided into the number of documents where the given termappears At a glance then we can see some of the unique words con-tained within a website if we select ldquoinverse document frequencyrdquo

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

At a very quick glance one can see the major words that make this aunique document This technique can be used on any textual docu-ment

Off-the-shelf tools can facilitate a deeper yet distant reading ofindividual websites as well especially the suite of research tools onwebsites such as Many Eyes and Voyant Using these tools howeverrequires that one puts individual data into each separate programrather than quickly navigating between them This stems from themassive amount of files and data involved Many Eyes limits files to5MB and Voyant has no hard and fast rule but can struggle withlarger files

This analysis was carried out on my personal computer55 Largerdata sets may necessitate a corresponding increase in computationalpower bringing social sciences and humanities researchers into con-tact with High Performance Computing (HPC) centres or consortiaThese extremely powerful computers provide researchers with com-puting power orders of magnitude above that of personal computersallowing results in hours instead of the weeks or months it might

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

take to process on a smaller system56 This has particular applicationin the basic processing of a very large web archive There are limita-tions however jobs must be submitted into queues in advancerather than on-site processing and perhaps most importantly HPCcentres prefer to run highly optimized code to minimize usage oftheir very expensive and scarcely allocated equipment As this level ofcoding expertise is often beyond humanities researchers researchteams must be formed

Humanities researchers would do well to familiarize themselveswith local HPC options many of which are more than happy towork on new and engaging questions These are accessible optionsin Ontario for example all but one university belongs to a HPCconsortium57 Yet as personal computing continues to dramaticallyincrease in speed and as storage prices continue to plummet textualanalysis is increasingly feasible on high-quality office systems An ini-tial exposure can generate questions and problems that if the needrears itself can subsequently be taken to the supercomputer

This case study is intended to provide one model for distantreading and the generation of finding aids in the interest of showingsome of the potential for working with born-digital sources It isintended to begin or jumpstart a conversation around the resourcesand skills necessary for the next generation A seemingly insur-mountable amount of information contained across over 73000files (as in the CDC collection) can be quickly navigated and iso-lated for researcher convenience

Conclusion Digital Literacy for the Next Generation ofHistorians

As a discipline I would like to suggest one path forward prepare thenext generation of historians at least some of them to deal with thedeluge of born-digital sources Fundamentally the challenge of born-digital sources is a pedagogical one For existing historians this is anissue of continual education for future historians an issue of profes-sional training (the annual Digital Humanities Summer Instituteheld at the University of Victoria each June is worth considering asan introduction to many digital concepts) Some off-the-shelf tools

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

can serve as gateways to deeper engagement notably the VoyantTools suite (wwwvoyant-toolsorg) and perhaps subsequently thelessons contained in the Programming Historian 2 That said digitalliteracy courses should become part of the essential pantheon of his-torical curricula This should not be mandatory but instead taughtalongside and within traditional historiography and research meth-ods courses In an era of diminishing resources and attention paid toknowledge translation and relevancy digital literacy courses are alsowell positioned to provide graduates with concrete portfolios andactionable skills If we want to be prepared to use digital method-ologies to study the post-1990 world we will need ready historians

Such a course should have six basic priorities First we must pro-vide historiographical context for new students on whatcomputational methods have been tried before How have quantita-tive and qualitative methods differed yet ultimately complementedeach other Second students in this course would need to develop anunderstanding of the digital basics cloud computing effectivebackup and versioning digital security and an understanding of howthe Internet is organized Third for effective cross-departmental andsustained engagement with digital history students must to learn howto digitally navigate conventional sources notable skill sets wouldinclude a familiarity with citation management software databasesand the possibility of using optical character recognition on a varietyof born-paper sources subsequently digitized through scanners orcameras This must also include a grasp of the scope and significanceof large digital depositories including Google Books the InternetArchive and others mentioned earlier Fourth students must learnhow to process large amounts of digital information this includes off-the-shelf visualization tools material discussed here and some of thephilosophical and methodological implications of a distance readingFifth students can learn basic programming through the ProgrammingHistorian Finally and perhaps most importantly students need anunderstanding of various ways to openly disseminate material on the web whether through blogging micro-blogging publishing webpages and so forth it is important to enable the great work that we do in the university to make its mark on the world Throughthis students will also come away from history departments with

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

comprehensive and marketable digital portfolios Such a coursewould focus more on homework than reading with special emphasisplaced on deliverable content such as websites interactive modelsand broadly implementable databases

Digital history and the readiness to work with born-digitalsources cannot stem out of pedagogy alone however Institutionalsupport is necessary While based on anecdotal evidence it appearsthat some university administrators have begun to embrace the shifttowards the digital humanities At the federal level SSHRC has iden-tified the Digital Economy as one of its priority areas58 Yet when itcomes to evaluation doctoral and post-doctoral researchers do nothave a ldquodigital humanitiesrdquo option open to them in terms of assessingcommittees leading to fears that their projects might be lost inbetween disciplinary areas of focus59 In light of the impending crisisof born-digital sources historians should encourage these initiatives

These sorts of digital projects do not always lend themselveswell to traditional publication venues The examples above from theCDC case study for example are best viewed as dynamically mal-leable models this will be one challenge as historians grapple withnew publication models This article argues first why digitalmethodologies are necessary and second how we can realize aspectsof this call The Internet is 20 years old and before historians realizeit we will be faced with the crisis of born-digital information over-load If history is to continue as the leading discipline inunderstanding the social and cultural past decisive movementtowards the digital humanities is necessary When the sea comes inwe must be ready to swim

IAN MILLIGAN is an assistant professor of Canadian history at theUniversity of Waterloo He is a founding co-editor of ActiveHistorycaan editor-at-large of the Programming Historian 2 and has written sev-eral articles in the fields of Canadian youth digital and labour history

IAN MILLIGAN est professeur adjoint en histoire canadienne agrave laUniversity of Waterloo Il est le fondateur et codirecteur du site

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Internet ActiveHistoryca ainsi qursquoun eacutediteur du site ProgrammingHistorian 2 Il a eacutecrit plusieurs articles au sujet de la jeunesse cana-dienne de la numeacuterisation et de lrsquohistoire des travailleurs

Endnotes

1 The question of when the past becomes a ldquohistoryrdquo to be professionallystudied is a tricky one Yet it is worth noting that the still-definitivework on the Baby Boom Generation appeared in 1996 with DougOwram Born at the Right Time A History of the Baby Boom Generationin Canada (Toronto University of Toronto Press 1996) This actuallyfollowed the earlier scholarly monograph Cyril Levitt Children ofPrivilege Student Revolt in the Sixties (Toronto University of TorontoPress 1984) and a host of other laudable works As academicdebatesworks on the 1960s now suggest the past becomes historysooner than we often expect

2 A critical point advanced in the field of literary criticism and theory byStephen Ramsay Reading Machines Towards an Algorithmic Criticism(Urbana University of Illinois Press 2011)

3 See ldquoIdleNoMore Timelinerdquo ongoingwwwidlenomoremakookcatimeline44 ltviewed 18 January 2013gt

4 These issues have been discussed elsewhere albeit in either a moreabridged form or with a focus on digitized historical sources A notableexception is Daniel J Cohen and Roy Rosenzweig Digital History AGuide to Gathering Preserving and Presenting the Past on the Web(Philadelphia University of Pennsylvania Press 2005) specifically theirchapter on ldquoCollecting History onlinerdquo Their focus is on the preservation-ist and archival challenges however Important notes are also found inDan Cohen ldquoHistory and the Second Decade of the Webrdquo RethinkingHistory 8 no 2 (June 2004) 293ndash301 also available online atwwwchnmgmueduessays-on-history-new-mediaessaysessayid=34

5 Figure taken from Jose Martinez ldquoTwitter CEO Dick Costolo Reveals Staggering Number of Tweets Per Dayrdquo ComplexTech wwwcomplexcomtech201210twitter-ceo-dick-costolo-reveals-staggering-number-of-tweets-per-day ltviewed 6 January 2013gtFurther more while a tweet itself is only 140 characters every tweetcomes with over 40 lines of metadata date information author biogra-phy and location information when the account was created the usersrsquofollowing and follower numbers country data and so forth This is aremarkable treasure trove of information for future historians but also apotential source of information overload See Sarah Perez ldquoThis is What

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

a Tweet Looks Likerdquo ReadWriteWeb (19 April 2010) wwwreadwritewebcomarchivesthis_is_what_a_tweet_looks_likephpltviewed 6 January 2013gt

6 This is further discussed in Ian Foster ldquoHow Computation ChangesResearchrdquo in Switching Codes Thinking Through Digital Technology inthe Humanities and the Arts eds Thomas Bartscherer and RoderickCoover (Chicago University of Chicago Press 2011) esp 22ndash4

7 Franco Moretti Graphs Maps Trees Abstract Models for Literary History(New York Verso 2005) 3ndash4

8 While not of prime focus here the longue dureacutee deserves definition Itstands opposite the history of events of instances covering instead avery long time span in an interdisciplinary social science framework In along essay Braudel noted that one can then see both structural crises ofeconomies and structures that constrain human society and develop-ment For more see Fernand Braudel ldquoHistory and the Social SciencesThe Longue Dureacuteerdquo in Fernand Braudel On History trans SarahMatthews (Chicago University of Chicago Press 1980) 27 The essaywas originally printed in the Annales ESC no 4 (OctoberndashDecember1958)

9 A great overview can be found at ldquoText Analysisrdquo Tooling Up for DigitalHumanities 2011 wwwtoolingupstanfordedupage_id=981 ltviewed19 December 2012gt

10 Jean-Baptiste Michel Yuan Kui Shen Aviva Presser Aiden Adrian VeresMatthew K Gray The Google Books Team Joseph P Pickett DaleHoiberg Dan Clancy Peter Norvig Jon Orwant Steven Pinker MartinA Nowak Erez Lieberman Aiden ldquoQuantitative Analysis of CultureUsing Millions of Digitized Booksrdquo Science 331 (14 January 2011)176ndash82 The tool is publicly accessible at ldquoGoogle Books NgramViewerrdquo Google Books 2010 wwwbooksgooglecomngrams ltviewed11 January 2013gt

11 Anthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

12 Comment by Jean-Baptiste Michel and Erez Lieberman Aiden onAnthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

13 For more information see the Canadian Century Research InfrastructureProject website at wwwccriuottawacaCCRIHomehtml and the Programme de recherche en deacutemographie historique at wwwgeneologyumontrealcafrleprdhhtm

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

14 ldquoWith Criminal Intentrdquo CriminalIntentorg wwwcriminalintentorgltviewed 20 December 2012gt

15 For an overview see ldquoAward Recipients mdash 2011rdquo Digging into DataWebpage wwwdiggingintodataorgHomeAwardRecipients2011tabid185Defaultaspx ltviewed 20 December 2012gt

16 One can do no better than the completely free Programming Historianwhich uses entirely free and open-source programming tools to learn thebasics of computational history See The Programming Historian 2 2012-Present wwwprogramminghistorianorg ltviewed 11 January 2013gt

17 Michael Katz The People of Hamilton Canada West Family and Class ina Mid-Nineteenth-Century City (Cambridge MA Harvard UniversityPress 1975) Also see A Gordon Darroch and Michael D OrnsteinldquoEthnicity and Occupational Structure in Canada in 1871 The VerticalMosaic in Historical Perspectiverdquo Canadian Historical Review 61 no 3(1980) 305ndash33

18 Ian Anderson ldquoHistory and Computingrdquo Making History The ChangingFace of the Profession in Britain 2008 wwwhistoryacukmakinghistoryresourcesarticleshistory_and_computinghtml ltviewed 20 December2012gt

19 Christopher Collins ldquoText and Document Visualizationrdquo TheHumanities and Technology Camp Greater Toronto Area VisualizationBootcamp University of Ontario Institute of Technology Oshawa Ont21 October 2011

20 The quotations by Luther and Poe are commonly used to make thispoint For the best example see Clay Shirky Cognitive SurplusCreativity and Generosity in a Connected Age Google eBook Penguin2010 For Mumford see Lewis Mumford The Myth of the Machine vol2 The Pentagon of Power (New York Harcourt Brace 1970) 182 asquoted in Gleick The Information A History A Theory A Flood (NewYork Pantheon 2011) 404

21 Adam Ostrow ldquoCisco CRS-3 Download Library of Congress in JustOver One Secondrdquo Mashable (9 March 2010) accessed onlinewwwmashablecom20100309cisco-crs-3

22 A note on data size is necessary The basic unit of information is a ldquobyterdquo (8 bits) which is roughly a single character A kilobyte (KB) is athousand bytes a megabyte (MB) a million a gigabyte (GB) a billion aterabyte (TB) a trillion and a petabyte (PB) a quadrillion Beyond thisthere are exabytes (EB) zettabytes (ZB) and yottabytes (YB)

23 These figures can of course vary depending on compression methodschosen Their math was based on an average of 8MB per book if it wasscanned at 600dpi and then compressed There are 26 million books inthe Library of Congress For more see Mike Ashenfelder ldquoTransferringlsquoLibraries of Congressrsquo of Datardquo Library of Congress Digital Preservation

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Blog 11 July 2011wwwblogslocgovdigitalpreservation201107transferring-libraries-of-congress-of-data ltviewed 11 January 2013gt

24 Gleick 232 25 ldquoAbout the Library of Congress Web Archivesrdquo Library of Congress Web

Archiving FAQs Undated wwwlocgovwebarchivingfaqhtmlfaqs_02ltviewed 11 January 2013gt

26 ldquo10000000000000000 Bytes Archivedrdquo Internet Archive Blog 26October 2012 wwwblogarchiveorg2012102610000000000000000-bytes-archivedltviewed 19 December 2012gt

27 Jean-Baptiste Michel et al 176ndash82 Published Online Ahead of Print16 December 2010 wwwsciencemagorgcontent3316014176ltviewed 29 July 2011gt See also Scott Cleland ldquoGooglersquos PiracyLiabilityrdquo Forbes 9 November 2011 wwwforbescomsitesscottcle-land20111109googles-piracy-liability ltviewed 11 January 2013gt

28 ldquoGovernment of Canada Web Archiverdquo Library and Archives Canada 17October 2007 wwwcollectionscanadagccawebarchivesindex-ehtmlltviewed 11 January 2013gt

29 For background see Bill Curry ldquoVisiting Library and Archives inOttawa Not Without an Appointmentrdquo Globe and Mail 1 May 2012wwwtheglobeandmailcomnewspoliticsottawa-notebookvisiting-library-and-archives-in-ottawa-not-without-an-appointmentarticle2418960 ltviewed 22 May 2012gt and the announcement ldquoLAC Begins Implementation of New Approach to service Deliveryrdquo Library and Archives Canada February 2012wwwcollectionscanadagccawhats-new013-560-ehtml ltviewed 22May 2012gt For professional responses see the ldquoCHArsquos comment onNew Directions for Library and Archives Canadardquo CHA Website March2012 wwwcha-shccaenNews_39items19html ltviewed 19December 2012gt and ldquoSave Library amp Archives Canadardquo CanadianAssociation of University Teachers launched AprilMay 2012 wwwsavelibraryarchivesca ltviewed 19 December 2012gt

30 Ian Milligan ldquoThe Smokescreen of lsquoModernizationrsquo at Library andArchives Canadardquo ActiveHistoryca 22 May 2012wwwactivehistoryca201205the-smokescreen-of-modernization-at-library-and-archives-canada ltviewed 19 December 2012gt

31 ldquoSave LAC May 2012 Campaign Updaterdquo SaveLibraryArchivesca May2012 wwwsavelibraryarchivescaupdate-2012-05aspx ltviewed 19December 2012gt

32 The challenges and opportunities presented by such projects is extensively discussed in Daniel J Cohen ldquoThe Future of Preserving the Pastrdquo CRM The Journal of Heritage Stewardship 2 no 2

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

(Summer 2005) 6ndash19 also available freely online at the RoyRosenzweig Center for History and New Mediawwwchnmgmueduessays-on-history-new-mediaessaysessayid=39For more see ldquoSeptember 11 Digital Archive Saving the Histories of September 11 2001rdquo Centre for History and New Media andAmerican Social History Project Center for Media and Learningwww911digitalarchiveorg ltviewed 11 January 2013gt ldquoHurricaneDigital Memory Bank Collecting and Preserving the Stories of Katrina and Ritardquo Center for History and New Media wwwhurricanearchiveorg ltviewed 11 January 2013gt and ldquoOccupyArchive Archiving the Occupy Movements from 2011rdquo Centre forHistory and New Media wwwoccupyarchiveorg ltviewed 11 January2013gt

33 Daniel Cohen and Roy Rosenzweig Digital History A Guide to Gathering Preserving and Presenting the Past on the Web(Philadelphia Temple University Press 2005) chapter on ldquoPreservingDigital Historyrdquo The chapter is available fully online atwwwchnmgmuedudigitalhistorypreserving5php ltviewed 19December 2012gt

34 For an overview see ldquoPreserving Our Digital Heritage The NationalDigital Information Infrastructure and Preservation Program 2010 Report a Collaborative Initiative of the Library of CongressrdquoDigitalpreservationgov wwwdigitalpreservationgovmultimediadocumentsNDIIPP2010Report_Postpdf ltviewed 19 December 2012gt

35 Steve Meloan ldquoNo Way to Run a Culturerdquo Wired 13 February 1998available online via the Internet Archivewwwwebarchiveorgweb20000619001705httpwwwwiredcomnewsculture012841030100html ltviewed 19 December 2012gt

36 Some of this has been discussed in greater depth on my website See theseries beginning with Ian Milligan ldquoWARC Files A Challenge forHistorians and Finding Needles in Haystacksrdquo ianmilliganca 12December 2012 wwwianmilliganca20121212warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks ltviewed 20 December2012gt

37 Unstructured data as opposed to structured data is important for histo-rians but brings additional challenges Some projects have producedstructured data novels marked up using XML encoding for example soone can immediately see place names dates character names and soforth This makes it much easier for a computer to read and understand

38 See William Noblett ldquoDigitization A Cautionary Talerdquo New Review ofAcademic Librarianship 17 (2011) 3 Tobias Blanke Michael Bryantand Mark Hedges ldquoOcropodium Open Source OCR for Small-ScaleHistorical Archivesrdquo Journal of Information Science 38 no 1 (2012) 83

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 13: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

This is to some degree comparing apples and oranges theLibrary of Congress is predominantly print whereas the InternetArchive has considerable multimedia holdings which includesvideos images and audio files The LOC is furthermore a morecurated collection whereas the Internet Archive draws from a widerrange of producers The Internet offers the advantages and disadvan-tages of being a more democratic archive For example a Geocitiessite (an early way for anybody to create their own website) created bya 12-year old Canadian teenager in the mid-1990s might be pre-served by the Internet Archive whereas it almost certainly would notbe saved by a national archive This difference in size and collectionsmanagement however is at the root of the changing historiansrsquotoolkit If we make use of these large data troves we can access a newand broad range of historical subjects

Conventional sources are also part of this deluge Google Bookshas been digitizing the printed word and it currently has 15 millionworks completed by 2020 it audaciously aims to digitize every bookever written While copyright issues loom the project will allowresearchers to access aggregate word and phrase data as seen in theGoogle n-gram project27 In Canada Library and Archives Canada(LAC) has a more circumscribed digitization project collecting ldquoarepresentative sample of Canadian websitesrdquo focusing particularlyon Government of Canada webpages as well as a curated approach

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

preserving case studies such as the 2006 federal election Currentlyit has 4TB of federal government information and 7TB of totalinformation in its larger web archive (a figure which as I note belowis unlikely to drastically improve despite ostensible goals of archivallsquomodernizationrsquo)28

In Canada LAC has mdash on the surface mdash has been advancingan ambitious digitization program as it realizes the challenges nowfacing archivists and historians Daniel J Caron the Librarian andArchivist of Canada has been outspoken on this front with severalpublic addresses on the question of digital sources and archivesThroughout 2012 primary source digitization at LAC preoccupiedcritics who saw it as a means to depreciate on-site access TheCanadian Association of University Teachers the Canadian HistoricalAssociation as well as other professional organizations mountedcampaigns against this shift in resources29 LAC does have a pointhowever with their modernization agenda The state of Canadarsquosonline collections are small and sorely lacking when compared totheir on-site collections and LAC does need to modernize and achievethe laudable goal of reaching audiences beyond Ottawa

Nonetheless digitization has unfortunately been used as asmokescreen to depreciate overall service offerings30 The modern-ization agenda would also see 50 percent of the digitization staffcut31 LACrsquos online collections are small they do not work withonline developers through Application Programming Interfaces(APIs a way for a computer program to talk to another computerand speed up research mdash Canadianaca is incidentally a leader in thisarea) and there have been no promising indications that this state ofaffairs will change Digitization has not been comprehensive Onlinepreservation is important and historians must fight for both on-siteand on-line access

A number of historians have recognized this challenge and areplaying an instrumental role in preserving the born-digital past TheRoy Rosenzweig Center for History and New Media at GeorgeMason University launched several high-profile preservation pro-jects the September 11th Digital Archive the Hurricane DigitalMemory Bank (focusing on Hurricanes Katrina and Rita) and as ofwriting the Occupy archive32 Massive amounts of online content

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

is curated and preserved photographs news reports blog posts andnow tweets These complement more traditional efforts of collectingand preserving oral histories and personal recollections which arethen geo-tagged (tied to a particular geographical location) tran-scribed and placed online These archives serve a dual purposepreserving the past for future generations while also facilitating easydissemination for todayrsquos audiences

Digital preservation however is a topic of critical importanceMuch of our early digital heritage has been lost Many websites espe-cially those before the Internet Archiversquos 1996 web archiving projectlaunch are completely gone Early storage mediums (from tape toearly floppy disks) have become obsolete and are now nearly inac-cessible Compounding this software development has seen earlyproprietary file formats depreciated ignored and eventually madewholly inaccessible33 Fortunately the problem of digital preservationwas recognized by the 1990s and in 2000 the American governmentestablished the National Digital Information Infrastructure andPreservation Program34 As early as 1998 digital preservationistsargued ldquo[h]istorians will look back on this era hellip and see a period ofvery little information A lsquodigital gaprsquo will span from the beginningof the wide-spread use of the computer until the time we eventuallysolve this problem What wersquore all trying to do is to shorten thatgaprdquo To bear home their successes that quotation is taken from anInternet Archive snapshot of Wired magazinersquos website35

Indeed greater attention towards rigorous metadata (which isbriefly put data about data often preserved in plain text format)open-access file formats documented by the InternationalOrganization for Standardization (ISO) and dedicated digitalpreservation librarians and archivists gives us hope for the future36

It is my belief that cloud storage the online hosting of data in largethird-party offsite centres mitigates the issue of obsolete storagemedium Furthermore that the move towards open source file for-mats even amongst large commercial enterprises such as Microsoftprovides greater documentation and uniformity

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A Brief Case Study Accessing and Navigating Canadarsquos DigitalCollections With an Emphasis on Tools for Born-DigitalSources

What can we do about this digital deluge There are no simpleanswers but historians must begin to conceptualize new additions totheir traditional research and pedagogical toolkits In the section thatfollows I will do two things introduce a case study and several toolsboth proprietary and those that are open and available off-the-shelfto reveal one way forward

For a case study I have elected to use Canadarsquos DigitalCollections (CDC) a collection of lsquodeadrsquo websites removed fromtheir original location on the World Wide Web and now digitallyarchived at LAC In this section I will introduce the collection usingdigitally-archived sources from the early World Wide Web and dis-cuss my research methodology before illustrating how we can derivemeaningful data from massive arrays of unstructured data37

This is a case study of a web archive Several of these method-ologies could be applied to other formats including large arrays oftextual material whether born-digital or not A distant reading of thistype is not medium specific It does however assume that the text ishigh quality Digitized primary documents do not always have textlayers and if they are instituted computationally through OpticalCharacter Recognition errors can appear Accuracy rates for digitizedtextual documents can range from as low as 40 percent for seven-teenth and eighteenth century documents to 80ndash90 percent formodern microfilmed newspapers to over 99 percent for typesetword-processed documents38 That said while historians today maybe more interested in applying distant reading methodologies tomore traditional primary source materials it is my belief that in thenot-so-distant future websites will be an increasingly significantsource for social cultural and political historians

What are Canadarsquos Digital Collections In 1996 IndustryCanada funded by the federal Youth Employment Strategy facili-tated the development of Canadian heritage websites by youth(defined as anyone between the age of 15 and 30) It issued contractsto private corporations that agreed to use these individuals as web

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

developers for the twofold purpose of developing digital skills andinstilling respect for and engagement with Canadian heritage andpreservation39 The collection initially called SchoolNet butrebranded as Canadarsquos Digital Collections in 1999 was accessiblefrom the CDC homepage After 2004 the program was shut downand the vast majority of the websites were archived on LACrsquos web-page This collection is now the largest born-digital Internet archivethat LAC holds worthy of study as content as well as thinkingabout methodological approaches

Conventional records about the program are unavailable in thetraditional LAC collection unsurprising in light of both its recentcreation and backlogs in processing and accessioning archival mate-rials We can use one element of our born-digital toolkit the InternetArchiversquos WaybackMachine (its archive of preserved early webpages)to get a sense of what this project looked like in its heyday The mainwebpage for the CDC was wwwcollectionsicgc However a visitortoday will see nothing but a collection of broken image links (two ofthem will bring you to the English and French-language versions ofthe LAC-hosted web archive although this is only discerniblethrough trial and error or through the sitersquos source code)

With the WaybackMachine we have a powerful albeit limitedopportunity to ldquogo back in timerdquo and experience the website as previ-ous web viewers would have (if a preserved website is available ofcourse) Among the over 420 million archived websites we can searchfor and find the original CDC site Between 12 October 1999 andthe most recent snapshot or web crawl (of the now-broken site) on 6July 2011 the site was crawled 410 times40 Unfortunately every siteversion until the 15 October 2000 crawl has been somehow cor-rupted Content has been instead replaced by Chinese characterswhich translate only into place names and other random words Issuesof early digital preservation noted above have reared their head inthis collection

From late 2000 onward however fully preserved CDC sites areaccessible Featured sites new entries success stories awards curric-ular units and prominent invitations to apply for project fundingpopulate the brightly coloured website The excitement is palpableJohn Manley then Industry Minister declared ldquoit is crucial that

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Canadarsquos young people get the skills and experience they need to par-ticipate in the knowledge-based economy of the futurerdquo and thatCDC ldquoshows that young people facing significant employment chal-lenges can contribute to and learn from the resources available onthe Information Highwayrdquo41 The documentation reflecting thelower level of technical expertise and scarcity of computer hardwareat the time was comprehensive participants were encouraged toidentify a collection digitize it process it (by cleaning up the data)preserve it on a hard drive and display it on the web Issues of copy-right were discussed as well as preservation issues and hardwarerequirements for computers (a computer with an Intel 386 processor8MB RAM a 1GB hard drive and 288bps modem was recom-mended) and training information for employers educators andparticipants42 Sample proposals community liaison ideas amidstother information set out a very comprehensive roadmap for a grass-roots bottom-up approach to digital preservation and heritage

Moving through the preserved websites is an interesting snap-shot at how web design standards changed Colour disappearsreplaced by a more subdued coordinated federal website standard(still largely in effect today) Surveys greet viewers from 2003onwards trying to learn how viewers discovered the site whetherthey think it is a good educational resource (they must have beenrelieved that the majority thought so mdash in 2003 some 84 percentfelt that it was a ldquogoodrdquo resource the best of its type)43 By 2004 theprogram was engaging in active remembrance of its past programsas it neared its ten-year anniversary Proposals stopped beingaccepted in July 2004 but the 600 webpages were still accessiblewith a comprehensive subject index alphabetical listing success sto-ries and detailed copyright information

The Canadarsquos Digital Collections site began to move towardsremoval in November 2006 and was initially completely gone A visi-tor to the site would learn that the content ldquois no longer availablerdquo andan e-mail address was provided for contributors to get their informationback44The information was completely inaccessible and by early 2007a bold notice announced that the now-empty site would close as of 2April 2007 Luckily by April 2007 a notice appeared with a declara-tion that the collections were now archived by Library and Archives

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Canada just in time for the website to completely disappear by July2007 In the archived version subject indexes and headings were goneAll that remained then and today for that matter is a straight-forwardalphabetical list and a disclaimer that noted ldquo[y]ou are viewing a doc-ument archived by Library and Archives Canada Please noteinformation may be out of date and some functionality lostrdquo

The question now is what we can do with this collection ofwebsites Could a distant reading as opposed to a close reading ona targeted level tell us something about this collectionFortunately we have a variety of tools that can help us navigate thislarge array of information several of which I use to navigate largeamounts of data First I create my own tools through computerprogramming This work is showcased in the following section I program in Wolfram Researchrsquos Mathematica 9 Mathematica is anintegrated platform for technical computing that allows users toprocess visualize and interact with exceptional arrays of informa-tion45 It is a language with an engaged and generous user base thatcomplements exhaustive well-written documentation fromWolfram Research To emphasize its ease one line of code couldimport a website in perfectly readable plaintext (avoiding unicodeerrors) as follows input=Import[ldquohttpactivehis-torycaldquordquoPlaintextrdquo] and a second one could make itall in lower text for textual analysis as such lowerpage=ToLowerCase[input] Variables are a critical building blockIn the former example ldquoinputrdquo now holds the entire plaintext of thewebsite ActiveHistoryca and then ldquolowerpagerdquo holds it in lowercaseOnce we have lower-cased information we are quickly able to extractthe following frequent words frequent phrases and could thus cre-ate a dynamic database of n-grams word clouds or run other textualanalysis programs on the material Much of the syntax adheres to arecognizable English-language format

Even more promising is Mathematicarsquos integration with theWolfram|Alpha database (itself accessible at wolframalphacom) acomputational knowledge engine that promises to put incredibleamounts of information at a userrsquos fingertips With a quick com-mand one can access the temperature at Torontorsquos Lester B PearsonInternational Airport at 500 pm on 18 December 1983 the rela-

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

tive popularity of names such as ldquoEdithrdquo or ldquoEmmardquo the Canadianunemployment rate in March 1985 or the Canadian GrossDomestic Product in December 199546 This allows one to quicklycross-reference findings with statistical information or find coinci-dental points of comparison As universities are now being activelyencouraged to contribute their information to Wolfram|Alpha thiswill become an increasingly powerful database

For many researchers the downside of Mathematica is a con-siderable one It is proprietary software and relatively expensive forstudent and non-student alike Outputs however can be dissemi-nated using the Computational Document Format (CDF) enablinghistorians to put their work online and entice the general public todynamically work with their models and findings

Programming is not the be all and end all Information visual-ization in the humanities is a growing international field withCanadian scholars heavily involved The most important and acces-sible example is Voyant Tools47 It is a powerful suite of textualanalysis tools Uploading onersquos own text or collection of texts draw-ing on a webpage or analyzing existing bodies of work one sees thefollowing a ldquoword cloudrdquo of a corpus click on any word to see itsrelative rise and fall over time (word trends) access the original textand click on any word to see its contextual placement within indi-vidual sentences It ably facilitates the ldquodistancerdquo reading of anyreasonably-sized body of textual information and should be a begin-ning point for those interested in this field The only downside isthat Voyant can choke on very large amounts of information as it isa textual analysis rather than robust data mining tool

There are several other complementary tools A notable one is theMany Eyes website hosted by IBM Research Many Eyes allows you toupload datasets be they textual tabular or numerical There are manyvisualization options provided you can plot relationships betweendata compare values (ie bar charts and bubble charts) track dataover time and apply information to maps The only downside toMany Eyes is that you must register and make your datasets available toall other registered researchers as well as IBM This limits your abilityto use copyrighted or proprietary information

More advanced technical users can avail themselves of a host of

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

textual analysis programs found elsewhere on the web The mostcomprehensive gathering of these tools can be found with theSoftware Environment for the Advancement of Scholarly Research(SEASR) and its Meandre workbench Meandre which can beinstalled as a local server on your system or mdash in a best case scenariomdash hosted by an institutional IT department allows you to tap intoa variety of workflows to analyze data48 One workflow mentionedin this article is the topic modeling toolkit found within the MachineLearning for Language Toolkit (MALLET) package which clustersunstructured data into ldquotopicsrdquo which appear together giving a quicksense of a documentrsquos structure49 Other workflows found withinSEASR include Dunningrsquos log-likelihood statistic which compares aselected document to a broader reference corpus For example youcould compare the lyrics that appear in 1968 against all lyrics in thepostwar era and learn what makes 1968 most unique50 sentimentanalyses which attempt to plot changes in emotional words andphrases over time and a variety of entity extraction algorithms(which find ldquopeoplerdquo ldquoplacesrdquo and so forth in a text) Corpuses thatwere previously too big to extract meaningful information from cannow be explored relatively quickly

As befits a technical side to history all of the freely availabletools discussed briefly here (notably wget SEASR and MALLET)are discussed in more technical terms in the peer-reviewedProgramming Historian 2 or on my own blog at ianmilliganca I haveelected to leave them online rather than provide commands in thisarticle they require some technical knowledge and directly workingfrom the web is quicker than here Please consider the following sec-tion an applied version of some of the lessons you can learn throughthese processes

From Web Collection to Dynamic Finding Aids

Let us briefly return to our archival metaphor Todayrsquos historiandealing with conventional sources often has to physically travel to anarchive requisition boxes perhaps take photographs of materials iftime is tight and otherwise take notes With born-digital sourceshowever the formula is reversed Our sources can come to us In a

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

few keystrokes we can bring an unparalleled quantity of informationto our home or work computers With Canadarsquos Digital Collectionsa single command line entry brought the entire archive onto my sys-tem it took days to process the information and I had to leave mycomputer on but it did not require any active user engagement

How big is the CDC collection in terms of sheer informationThe plaintext without any HTML markup is approximately7805135 words If each page contained 250 words it would be theequivalent of a 31220 page essay or 1000 dissertations However itis even more complicated than just reading 1000 dissertations Theinformation is spread across 73585 separate HTML files themselvespart of a much broader corpus of 278785 files Even the sheer sizeof the database requires big data methodologies an aggregate size of726 GB 360 MB of plaintext with HTML encoding reduced to alsquomerersquo 50MB if all encoding is removed A 50MB plain text docu-ment can quickly slow down even a relatively powerful personalcomputer There is no simple way to make this a conventionallyaccessible amount of information

How can we access this information Several different methodsare possible First we can work with the information on the internetitself or we can download it onto our own system I prefer the lattermethod as it decreases computational time adds flexibility andenables free rein to experiment with material Before you do thisthough you need to be familiar with the websitersquos ldquoterms of servicerdquoor the legal documentation that outlines how users are allowed to usea website resources or software In this case LAC provides generousnon-commercial reproduction terms information ldquomay be repro-duced in part or in whole and by any means without charge orfurther permission unless otherwise specifiedrdquo51

The CDC web archive index is organized like so the index islocated at wwwepelac-bacgcca100205301iccdcEAlphabetaspwith every subsequent website branching off the cdc folder Forexample one website would be hosted in wwwepelac-bacgcca100205301iccdcABbenevolat the next in hellipcdcabnatureand so on Using wget a free software program you can downloadall websites in a certain structure the program recursively downloadsinformation in a methodical fashion I first used wget to save the

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

cookies (little files kept on your system in this case one that notedthat you had seen the disclaimer about the archival nature of thewebsite) and then with an one-line command in my command lineto download the entire database52 Due to the imposition of a ran-dom wait and bandwidth limit which meant that I would downloadone file every half to one and a half seconds at a maximum of 30kilobytes a second the process took six days In the ProgrammingHistorian 2 I have an article expanding upon wget for humanitiesresearchers

Hypertext itself deserves a brief discussion The basic buildingblock of the web is the link one page is connected to another andpeople navigate through these webs While books and articles forexample are connected to each other by virtue of ideas and concretecitations these must often be inferred online HTML born-digitalsources are different in that the link offers a very concrete file One

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Visualizing the CDC Collection as a Series of Hyperlinks

quick visualization the circular cluster seen above lets us see theunique nature of this information We can go to each web pagescrape the links and see where they take viewers As we see in thebelow figure the CDC collection can be visualized as a series ofhyperlinks (each line) The cluster in the middle is the CDC collec-tion itself whereas the strands on the outside represent externalwebsites Using Mathematica I have generated a dynamic modelyou can move your mouse over each line learn what link it is andthus see where the collection fits into the broader Internet The siteis limited to following two links beyond the first page Through thisapproach we see the interconnection of knowledge and relationshipsbetween different sources At a glimpse we can see suggestions to fol-low other sites and determine if any nodes are shared amongst thevarious websites revealed

Visualization Moving Beyond Word Clouds

Where to begin with a large collection such as the Canadarsquos DigitalCollection corpus As a historian who is primarily interested in thewritten word my first step was to isolate the plaintext throughout thematerial so that I could begin rudimentary data mining and textualanalysis on the corpus For this I approached the CDC collection ontwo distinct levels mdash first the aggregate collection and second divid-ing the collection into each lsquofilersquo or individual website mdash derivingword and phrase data for each Imagine if you could look throughinto an archival collection and get a quick sense of what topics andideas were covered Taking that thought further imagine the utility ifyou could do this with individual archival boxes and even files Witha plethora of born-digital sources this is now a necessary step

On an aggregate level word and phrase data is a way to explorethe information within How often do various words within thenearly eight million words of the collection appear UsingMathematica we can quickly generate a list of words sorted by theirrelative frequency we learn that the word ldquoCanadardquo appears 26429times ldquocollectionsrdquo 11436 and ldquo[L]ouisbourgrdquo an astounding9365 times Reading such a list can quickly wear out a reader and isof minimal visual attractiveness Two quick decisions were made to

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

enhance data usability first the information was ldquocase-foldedrdquo allsent to lower-case this enables us to count the total appearances of aword such as ldquodigitalrdquo regardless of whether or not it appears at thebeginning of a sentence

Second when dealing with word frequency mdash as opposed tophrase frequency mdash ldquostop wordsrdquo should be removed as they werehere These are ldquoextremely common words which would appear to beof little value in helping select documents matching a user needrdquowords such as ldquotherdquo ldquoandrdquo ldquotordquo and so forth They matter forphrases (the example ldquoto be or not to berdquo is comprised entirely ofstop words) but will quickly dominate any word-level visualizationor list53 Other words may need to be added or removed from stoplists depending on the corpus for example if the phrase ldquopage num-berrdquo appeared on every page that would be useless information forthe researcher

The first step then is to use Mathematica to format word fre-quency information into Wordlenet readable format (if we put allof the text into that website it will crash mdash so we need to followthe instructions on that website) we can then draw up a wordcloud of the entire collection Here we can see both the advantagesand disadvantages of static word clouds What can we learn from this

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A Word Cloud of All Words in the CDC Corpus

illustration First without knowing anything about the sites we canimmediately learn that it is dealing with CanadaCanadians that itis a collection with documents in French and English that we aredealing with digitized collections and that there is a fair amount ofregional representation not based upon population (note thatldquoontariordquo is less frequent than both ldquonewfoundlandrdquo ldquonova scotiardquoand ldquoalbertardquo although ldquoqueacutebecrdquo and ldquoquebecrdquo is not terribly promi-nent) Certain kinds of history dominate communities familynames the built form (houses churches schools and so forth) At avery macro level we can learn several characteristics before doing anydeeper investigation

These forms of visualization need to be used with extreme cau-tion however First while visually attractive it leaves a severeamount of ambiguity are ldquoontariordquo and ldquoqueacutebecrdquo equally promi-nent for example or is there a slight divergence Even moreproblematically we cannot learn context from the above picturedoes the word ldquoschoolrdquo refer to built structures or educational expe-riences The same question with applies to ldquochurchrdquo Do mentionsof ldquonewfoundlandrdquo refer to historical experiences there or withexpatriates or in travel narratives Word clouds are useful first stepsand have their place but need to be used with extreme caution Formore scholarly applications tables offer much more precise inter-pretation As the second figure below demonstrates they can bemore helpful when dealing with the individual sites cuttingthrough ambiguity The figure below is a visualization of the web-site ldquopast and presentrdquo Despite the ambiguous title we can quicklylearn what this website contains from this visualization Wordle isthus a useful first step analogous to taking magically taking a quicklook through the side of an archival box We still need more infor-mation however to really make this form of textual analysishelpful To navigate this problem I have created my own dynamicfinding aid program using Mathematica which scales to any born-digital web collection

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Two approaches were taken First a textual representation of wordand phrase frequency allows us to determine the content of a given filerelatively quickly I wrote a short program to extract word frequency(stop words excluded) and phrase frequency For phrases we speak ofn-grams bigrams trigrams quadgrams and fivegrams although onecould theoretically do any higher number A bigram can be a combina-tion of two characters such as lsquobirsquo two syllables or mdash pertinent in ourcase mdash two words An example ldquocanada isrdquo A trigram is three wordsquadgram is four words and a fivegram is five words and so on

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Word Cloud of the File ldquoPast and Presentrdquo

A viewer for n-grams in the CDC Collection

The CDC collection demonstrates this approachrsquos utility Usingmy Mathematica textual CDC browser pictured above I randomlyexplored file 267 the ambiguously named ldquopast and presentrdquo web-site This is one of the largest individual websites 122363 words orabout 493 standard typed pages Such an ambiguous name could beabout almost anything Initially I decided to study word frequency

ldquoldquoalbertardquo 758rdquo ldquoldquolistenrdquo 490rdquo ldquoldquosettlersrdquo 467rdquoldquoldquonewrdquo 454rdquo ldquoldquocanadardquo 388rdquo ldquoldquolandrdquo 362rdquo ldquoldquoreadrdquo328rdquo ldquoldquopeoplerdquo 295rdquo ldquoldquosettlementrdquo 276rdquo ldquoldquochineserdquo268rdquo ldquoldquocanadianrdquo 264rdquo ldquoldquoplacerdquo 255rdquo ldquoldquoheritagerdquo254rdquo ldquoldquoedmontonrdquo 250rdquo ldquoldquoyearsrdquo 238rdquo ldquoldquowestrdquo 236rdquoldquoldquoraymondrdquo 218rdquo ldquoldquoukrainianrdquo 216rdquo ldquoldquocamerdquo 213rdquoldquoldquocalgaryrdquo 209rdquo ldquoldquonamesrdquo 206rdquo ldquoldquoranchrdquo 204rdquoldquoldquoupdatedrdquo 193rdquo ldquoldquohistoryrdquo 190rdquo ldquoldquocommunitiesrdquo183rdquo hellip

Rather than having to visit the website at a glance we can almostimmediately see what the site is about More topics become apparentwhen one looks at trigrams for example in this automatically gen-erated output of generated top phrases

ldquoldquofirstrdquo ldquopeoplerdquo ldquosettlersrdquo 100rdquo ldquoldquotherdquo ldquounitedrdquoldquostatesrdquo 72rdquo ldquoldquoonerdquo ldquoofrdquo ldquotherdquo 59rdquo ldquoldquoofrdquo ldquotherdquoldquowestrdquo 58rdquo ldquoldquosettlersrdquo ldquolastrdquo ldquoupdatedrdquo 56rdquo ldquoldquopeoplerdquoldquosettlersrdquo ldquolastrdquo 56rdquo ldquoldquoalbertansrdquo ldquofirstrdquo ldquopeoplerdquo 56rdquoldquoldquoadventurousrdquo ldquoalbertansrdquo ldquofirstrdquo 56rdquo ldquoldquotherdquo ldquobarrdquoldquourdquo 55rdquo ldquoldquoopeningrdquo ldquoofrdquo ldquotherdquo 54rdquo ldquoldquocommunitiesrdquoldquoadventurousrdquo ldquoalbertansrdquo 54rdquo ldquoldquonewrdquo ldquocommunitiesrdquoldquoadventurousrdquo 54rdquo ldquoldquobarrdquo ldquourdquo ldquoranchrdquo 53rdquo ldquoldquolistenrdquoldquotordquo ldquotherdquo 52rdquo hellip

Through a few clicks we can navigate and determine the basiccontours mdash from a distance mdash of what the site pertains to In aviewer you can use a slider to dynamically see more informationor less information quickly toggling between bigrams trigramsand quadgrams allowing you to quickly hone in on immediately

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

relevant information This can be useful but we need to moveback even further to achieve a distant reading of this born-digitalcollection

If one zooms back out to this very distant level we can extractn-gram data from the entire collection We have two approachesfirst aggregate appearances of a given term (if ldquoaboriginalrdquo appears457 times then n=457) or second relative appearances of a giventerm (if ldquoaboriginalrdquo appears in 45 of 384 sites then n=45384 or01171875) In my other work on music lyrics the latter optionenables you to control for the lsquochorusrsquo effect as the former can be drowned out by one song with repetitive words Given thenature of this collection I elected for the former approach Whatcan we learn from trigrams for example After an initial set of trigrams pertaining to the Louisbourg digitization project whichhas boilerplate language on each page as well as common ldquostop-wordrdquo equivalents of phrases (especially pronounced in French) we come across the first set of meaningful phrases excerptedbelow

ldquoexperiencerdquo ldquoinrdquo ldquotherdquo 919 ldquotransferredrdquo ldquotordquo ldquotherdquo918 ldquofundedrdquo ldquobyrdquo ldquotherdquo 918 ldquoofrdquo ldquotherdquo ldquoprovincialrdquo915 ldquothisrdquo ldquowebrdquo ldquositerdquo 915 ldquo409rdquo ldquoregistryrdquo ldquo2rdquo914 ldquoarerdquo ldquonordquo ldquolongerrdquo 914 ldquogouvernementrdquo ldquodurdquo ldquocanadardquo 910 ldquouserdquo ldquothisrdquo ldquolinkrdquo 887ldquopagerdquo ldquoorrdquo ldquoyourdquo 887 ldquofollowingrdquo ldquopagerdquo ldquoorrdquo 887 ldquoautomaticallyrdquo ldquotransferredrdquo ldquotordquo 887 ldquoberdquoldquoautomaticallyrdquo ldquotransferredrdquo 887 ldquowillrdquo ldquoberdquo ldquoautomaticallyrdquo 887 ldquoindustryrdquo ldquoyourdquo ldquowillrdquo 887ldquomultimediardquo ldquoindustryrdquo ldquoyourdquo 887 ldquopracticalrdquo ldquoexperiencerdquo ldquoinrdquo 887 ldquogainedrdquo ldquopracticalrdquo ldquoexperiencerdquo887 ldquotheyrdquo ldquogainedrdquo ldquopracticalrdquo 887 ldquowhilerdquo ldquotheyrdquoldquogainedrdquo 887 ldquocanadiansrdquo ldquowhilerdquo ldquotheyrdquo 887ldquoyoungrdquo ldquocanadiansrdquo ldquowhilerdquo 887 ldquoshowcaserdquo ldquoofrdquoldquoworkrdquo 887 ldquoexcellentrdquo ldquoshowcaserdquo ldquoofrdquo 887 hellip

Here we see an array of common language funding informa-tion link information project information spread across over 887

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

websites Reading down we see information about governments jobexperience project information initiatives re-structuring and soforth At a glance we can see what language is common to all web-sites as opposed to unique information Moving down we find themore common but less universal ldquothe history ofrdquo for example ldquoher-itage community foundationrdquo libraries and so forth By the end ofthe list we can move into the unique or very rare phrases that appearonly once twice or three times

Moving beyond word and phrase frequency there are a fewmore sophisticated off-the-shelf options that can help us make senseof large arrays of unstructured data Topic modeling is among themost developed as Princetonrsquos David M Blei (a Computer Scienceprofessor) describes in a general introduction to the topic

While more and more texts are available online we simplydo not have the human power to read and study them toprovide the kind of [traditional] browsing experiencedescribed above To this end machine learning researchershave developed probabilistic topic modeling a suite of algo-rithms that aim to discover and annotate large archives ofdocuments with thematic information Topic modelingalgorithms are statistical methods that analyze the words ofthe original texts to discover the themes that run throughthem how those themes are connected to each other andhow they change over time hellip Topic modeling enables usto organize and summarize electronic archives at a scalethat would be impossible by human annotation55

Using the aforementioned SEASR Meandre workbench (or viaMALLET itself and a command-line interface) we can quickly beginto run our born-digital information through the topic-modelingalgorithm The results are notable and come in two formats a wordcloud visualization as well as an XML file containing a line-by-linebreakdown of each topic Taking the aforementioned file 267 (ldquopastand presentrdquo) we can begin to get a sense of the topics this large web-site discusses

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

At a glance we can learn about some of the main themes the struc-ture of Alberta (Alberta towns bridges lakes) the geographic shapeof the area (life land lands areas links grass and interestingly lan-guage) the stories (story knight archives but also interestinglyknight and wife) and so forth The word cloud gives us a sense ofthe most relevant terms that define each topic but we also have a3200-line XML file broken into topics which lets us further refineour inquiry

A last set of examples will help us move from the aggregate levelof hundreds of thousands of lines of text to the individual line inessence creating our own search engine to navigate this collection

There are a few good reasons why we should not use Google todo all of this in our own work First we do not know how Googleworks precisely It is a proprietary black box While the original algo-rithm is available the actual methods that go into determining theranking of search terms is a trade secret Second Google is optimizedfor a number of uses mainly to deliver the most relevant informa-tion to the user Historians and archivists are looking often forspecific information within old websites and it behooves us to exam-ine them through the lens of more specialized tools With thesehesitations in mind I have sought to develop my own rudimentarysearch engine for this born-digital collection

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A two-pronged approach can take us to relevant informationquickly In the following example I am carrying out a project onblack history We quickly search all of the CDC word frequency filesfor the occurrence of the word ldquofreedomrdquo ldquoblackrdquo and ldquolibertyrdquo(these terms of course require some background knowledge thesearch is shaped by the user) obtain a list and then map it onto achart in order to quickly visualize their locations in the collection Forexample in the graph above we see a visualization of a given term inthis case our three search terms As one moves the mouse over eachbar the frequency data for a given term appears We can also zoomand enhance this graph to use it as we wish A demonstration of thiscan be found online at wwwianmilliganca20130118cdc-tool-1

We can then move down into the data In the example aboveldquo047rdquo labeled ldquoblackgoldtxtrdquo contains 31 occurrences of the wordldquoontariordquo Moving to my Mathematica keyword-in-context routinewe can see where these occurrences of the word appear

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

A snapshot of the dynamic finding aid tool

We see that this website focuses mostly on southwestern Ontariowhere many oil fields are If we move forward we can see that it per-tains to the nineteenth century a particular region as well assubsequent memorialization activities (mention of a museum forexample) All this can be done from within the textual client asopposed to going out into individual websites Broad contoursemerge from this distant reading but they also facilitate deeper read-ing by allowing us to promptly realize critical sources

Another model through which we can visualize documents isthrough a viewer primarily focusing on term document frequencyand inverse frequency terms discussed below To prepare this mater-ial we are combining our earlier data about word frequency (that theword ldquoexperiencerdquo appears 919 times perhaps) with aggregate infor-mation about frequency in the entire corpus (ie the wordldquoexperiencerdquo appears 1766 times) The number of times that a termappears in a given website can be defined as tf (term frequency)Another opportunity is inverse document frequency where we canlearn what words are most distinct and unique in a given documentThe formula for this is the oft-cited tf-idf a staple of informationretrieval theory or tf times the logarithm of the number of docu-ments divided into the number of documents where the given termappears At a glance then we can see some of the unique words con-tained within a website if we select ldquoinverse document frequencyrdquo

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

At a very quick glance one can see the major words that make this aunique document This technique can be used on any textual docu-ment

Off-the-shelf tools can facilitate a deeper yet distant reading ofindividual websites as well especially the suite of research tools onwebsites such as Many Eyes and Voyant Using these tools howeverrequires that one puts individual data into each separate programrather than quickly navigating between them This stems from themassive amount of files and data involved Many Eyes limits files to5MB and Voyant has no hard and fast rule but can struggle withlarger files

This analysis was carried out on my personal computer55 Largerdata sets may necessitate a corresponding increase in computationalpower bringing social sciences and humanities researchers into con-tact with High Performance Computing (HPC) centres or consortiaThese extremely powerful computers provide researchers with com-puting power orders of magnitude above that of personal computersallowing results in hours instead of the weeks or months it might

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

take to process on a smaller system56 This has particular applicationin the basic processing of a very large web archive There are limita-tions however jobs must be submitted into queues in advancerather than on-site processing and perhaps most importantly HPCcentres prefer to run highly optimized code to minimize usage oftheir very expensive and scarcely allocated equipment As this level ofcoding expertise is often beyond humanities researchers researchteams must be formed

Humanities researchers would do well to familiarize themselveswith local HPC options many of which are more than happy towork on new and engaging questions These are accessible optionsin Ontario for example all but one university belongs to a HPCconsortium57 Yet as personal computing continues to dramaticallyincrease in speed and as storage prices continue to plummet textualanalysis is increasingly feasible on high-quality office systems An ini-tial exposure can generate questions and problems that if the needrears itself can subsequently be taken to the supercomputer

This case study is intended to provide one model for distantreading and the generation of finding aids in the interest of showingsome of the potential for working with born-digital sources It isintended to begin or jumpstart a conversation around the resourcesand skills necessary for the next generation A seemingly insur-mountable amount of information contained across over 73000files (as in the CDC collection) can be quickly navigated and iso-lated for researcher convenience

Conclusion Digital Literacy for the Next Generation ofHistorians

As a discipline I would like to suggest one path forward prepare thenext generation of historians at least some of them to deal with thedeluge of born-digital sources Fundamentally the challenge of born-digital sources is a pedagogical one For existing historians this is anissue of continual education for future historians an issue of profes-sional training (the annual Digital Humanities Summer Instituteheld at the University of Victoria each June is worth considering asan introduction to many digital concepts) Some off-the-shelf tools

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

can serve as gateways to deeper engagement notably the VoyantTools suite (wwwvoyant-toolsorg) and perhaps subsequently thelessons contained in the Programming Historian 2 That said digitalliteracy courses should become part of the essential pantheon of his-torical curricula This should not be mandatory but instead taughtalongside and within traditional historiography and research meth-ods courses In an era of diminishing resources and attention paid toknowledge translation and relevancy digital literacy courses are alsowell positioned to provide graduates with concrete portfolios andactionable skills If we want to be prepared to use digital method-ologies to study the post-1990 world we will need ready historians

Such a course should have six basic priorities First we must pro-vide historiographical context for new students on whatcomputational methods have been tried before How have quantita-tive and qualitative methods differed yet ultimately complementedeach other Second students in this course would need to develop anunderstanding of the digital basics cloud computing effectivebackup and versioning digital security and an understanding of howthe Internet is organized Third for effective cross-departmental andsustained engagement with digital history students must to learn howto digitally navigate conventional sources notable skill sets wouldinclude a familiarity with citation management software databasesand the possibility of using optical character recognition on a varietyof born-paper sources subsequently digitized through scanners orcameras This must also include a grasp of the scope and significanceof large digital depositories including Google Books the InternetArchive and others mentioned earlier Fourth students must learnhow to process large amounts of digital information this includes off-the-shelf visualization tools material discussed here and some of thephilosophical and methodological implications of a distance readingFifth students can learn basic programming through the ProgrammingHistorian Finally and perhaps most importantly students need anunderstanding of various ways to openly disseminate material on the web whether through blogging micro-blogging publishing webpages and so forth it is important to enable the great work that we do in the university to make its mark on the world Throughthis students will also come away from history departments with

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

comprehensive and marketable digital portfolios Such a coursewould focus more on homework than reading with special emphasisplaced on deliverable content such as websites interactive modelsand broadly implementable databases

Digital history and the readiness to work with born-digitalsources cannot stem out of pedagogy alone however Institutionalsupport is necessary While based on anecdotal evidence it appearsthat some university administrators have begun to embrace the shifttowards the digital humanities At the federal level SSHRC has iden-tified the Digital Economy as one of its priority areas58 Yet when itcomes to evaluation doctoral and post-doctoral researchers do nothave a ldquodigital humanitiesrdquo option open to them in terms of assessingcommittees leading to fears that their projects might be lost inbetween disciplinary areas of focus59 In light of the impending crisisof born-digital sources historians should encourage these initiatives

These sorts of digital projects do not always lend themselveswell to traditional publication venues The examples above from theCDC case study for example are best viewed as dynamically mal-leable models this will be one challenge as historians grapple withnew publication models This article argues first why digitalmethodologies are necessary and second how we can realize aspectsof this call The Internet is 20 years old and before historians realizeit we will be faced with the crisis of born-digital information over-load If history is to continue as the leading discipline inunderstanding the social and cultural past decisive movementtowards the digital humanities is necessary When the sea comes inwe must be ready to swim

IAN MILLIGAN is an assistant professor of Canadian history at theUniversity of Waterloo He is a founding co-editor of ActiveHistorycaan editor-at-large of the Programming Historian 2 and has written sev-eral articles in the fields of Canadian youth digital and labour history

IAN MILLIGAN est professeur adjoint en histoire canadienne agrave laUniversity of Waterloo Il est le fondateur et codirecteur du site

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Internet ActiveHistoryca ainsi qursquoun eacutediteur du site ProgrammingHistorian 2 Il a eacutecrit plusieurs articles au sujet de la jeunesse cana-dienne de la numeacuterisation et de lrsquohistoire des travailleurs

Endnotes

1 The question of when the past becomes a ldquohistoryrdquo to be professionallystudied is a tricky one Yet it is worth noting that the still-definitivework on the Baby Boom Generation appeared in 1996 with DougOwram Born at the Right Time A History of the Baby Boom Generationin Canada (Toronto University of Toronto Press 1996) This actuallyfollowed the earlier scholarly monograph Cyril Levitt Children ofPrivilege Student Revolt in the Sixties (Toronto University of TorontoPress 1984) and a host of other laudable works As academicdebatesworks on the 1960s now suggest the past becomes historysooner than we often expect

2 A critical point advanced in the field of literary criticism and theory byStephen Ramsay Reading Machines Towards an Algorithmic Criticism(Urbana University of Illinois Press 2011)

3 See ldquoIdleNoMore Timelinerdquo ongoingwwwidlenomoremakookcatimeline44 ltviewed 18 January 2013gt

4 These issues have been discussed elsewhere albeit in either a moreabridged form or with a focus on digitized historical sources A notableexception is Daniel J Cohen and Roy Rosenzweig Digital History AGuide to Gathering Preserving and Presenting the Past on the Web(Philadelphia University of Pennsylvania Press 2005) specifically theirchapter on ldquoCollecting History onlinerdquo Their focus is on the preservation-ist and archival challenges however Important notes are also found inDan Cohen ldquoHistory and the Second Decade of the Webrdquo RethinkingHistory 8 no 2 (June 2004) 293ndash301 also available online atwwwchnmgmueduessays-on-history-new-mediaessaysessayid=34

5 Figure taken from Jose Martinez ldquoTwitter CEO Dick Costolo Reveals Staggering Number of Tweets Per Dayrdquo ComplexTech wwwcomplexcomtech201210twitter-ceo-dick-costolo-reveals-staggering-number-of-tweets-per-day ltviewed 6 January 2013gtFurther more while a tweet itself is only 140 characters every tweetcomes with over 40 lines of metadata date information author biogra-phy and location information when the account was created the usersrsquofollowing and follower numbers country data and so forth This is aremarkable treasure trove of information for future historians but also apotential source of information overload See Sarah Perez ldquoThis is What

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

a Tweet Looks Likerdquo ReadWriteWeb (19 April 2010) wwwreadwritewebcomarchivesthis_is_what_a_tweet_looks_likephpltviewed 6 January 2013gt

6 This is further discussed in Ian Foster ldquoHow Computation ChangesResearchrdquo in Switching Codes Thinking Through Digital Technology inthe Humanities and the Arts eds Thomas Bartscherer and RoderickCoover (Chicago University of Chicago Press 2011) esp 22ndash4

7 Franco Moretti Graphs Maps Trees Abstract Models for Literary History(New York Verso 2005) 3ndash4

8 While not of prime focus here the longue dureacutee deserves definition Itstands opposite the history of events of instances covering instead avery long time span in an interdisciplinary social science framework In along essay Braudel noted that one can then see both structural crises ofeconomies and structures that constrain human society and develop-ment For more see Fernand Braudel ldquoHistory and the Social SciencesThe Longue Dureacuteerdquo in Fernand Braudel On History trans SarahMatthews (Chicago University of Chicago Press 1980) 27 The essaywas originally printed in the Annales ESC no 4 (OctoberndashDecember1958)

9 A great overview can be found at ldquoText Analysisrdquo Tooling Up for DigitalHumanities 2011 wwwtoolingupstanfordedupage_id=981 ltviewed19 December 2012gt

10 Jean-Baptiste Michel Yuan Kui Shen Aviva Presser Aiden Adrian VeresMatthew K Gray The Google Books Team Joseph P Pickett DaleHoiberg Dan Clancy Peter Norvig Jon Orwant Steven Pinker MartinA Nowak Erez Lieberman Aiden ldquoQuantitative Analysis of CultureUsing Millions of Digitized Booksrdquo Science 331 (14 January 2011)176ndash82 The tool is publicly accessible at ldquoGoogle Books NgramViewerrdquo Google Books 2010 wwwbooksgooglecomngrams ltviewed11 January 2013gt

11 Anthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

12 Comment by Jean-Baptiste Michel and Erez Lieberman Aiden onAnthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

13 For more information see the Canadian Century Research InfrastructureProject website at wwwccriuottawacaCCRIHomehtml and the Programme de recherche en deacutemographie historique at wwwgeneologyumontrealcafrleprdhhtm

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

14 ldquoWith Criminal Intentrdquo CriminalIntentorg wwwcriminalintentorgltviewed 20 December 2012gt

15 For an overview see ldquoAward Recipients mdash 2011rdquo Digging into DataWebpage wwwdiggingintodataorgHomeAwardRecipients2011tabid185Defaultaspx ltviewed 20 December 2012gt

16 One can do no better than the completely free Programming Historianwhich uses entirely free and open-source programming tools to learn thebasics of computational history See The Programming Historian 2 2012-Present wwwprogramminghistorianorg ltviewed 11 January 2013gt

17 Michael Katz The People of Hamilton Canada West Family and Class ina Mid-Nineteenth-Century City (Cambridge MA Harvard UniversityPress 1975) Also see A Gordon Darroch and Michael D OrnsteinldquoEthnicity and Occupational Structure in Canada in 1871 The VerticalMosaic in Historical Perspectiverdquo Canadian Historical Review 61 no 3(1980) 305ndash33

18 Ian Anderson ldquoHistory and Computingrdquo Making History The ChangingFace of the Profession in Britain 2008 wwwhistoryacukmakinghistoryresourcesarticleshistory_and_computinghtml ltviewed 20 December2012gt

19 Christopher Collins ldquoText and Document Visualizationrdquo TheHumanities and Technology Camp Greater Toronto Area VisualizationBootcamp University of Ontario Institute of Technology Oshawa Ont21 October 2011

20 The quotations by Luther and Poe are commonly used to make thispoint For the best example see Clay Shirky Cognitive SurplusCreativity and Generosity in a Connected Age Google eBook Penguin2010 For Mumford see Lewis Mumford The Myth of the Machine vol2 The Pentagon of Power (New York Harcourt Brace 1970) 182 asquoted in Gleick The Information A History A Theory A Flood (NewYork Pantheon 2011) 404

21 Adam Ostrow ldquoCisco CRS-3 Download Library of Congress in JustOver One Secondrdquo Mashable (9 March 2010) accessed onlinewwwmashablecom20100309cisco-crs-3

22 A note on data size is necessary The basic unit of information is a ldquobyterdquo (8 bits) which is roughly a single character A kilobyte (KB) is athousand bytes a megabyte (MB) a million a gigabyte (GB) a billion aterabyte (TB) a trillion and a petabyte (PB) a quadrillion Beyond thisthere are exabytes (EB) zettabytes (ZB) and yottabytes (YB)

23 These figures can of course vary depending on compression methodschosen Their math was based on an average of 8MB per book if it wasscanned at 600dpi and then compressed There are 26 million books inthe Library of Congress For more see Mike Ashenfelder ldquoTransferringlsquoLibraries of Congressrsquo of Datardquo Library of Congress Digital Preservation

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Blog 11 July 2011wwwblogslocgovdigitalpreservation201107transferring-libraries-of-congress-of-data ltviewed 11 January 2013gt

24 Gleick 232 25 ldquoAbout the Library of Congress Web Archivesrdquo Library of Congress Web

Archiving FAQs Undated wwwlocgovwebarchivingfaqhtmlfaqs_02ltviewed 11 January 2013gt

26 ldquo10000000000000000 Bytes Archivedrdquo Internet Archive Blog 26October 2012 wwwblogarchiveorg2012102610000000000000000-bytes-archivedltviewed 19 December 2012gt

27 Jean-Baptiste Michel et al 176ndash82 Published Online Ahead of Print16 December 2010 wwwsciencemagorgcontent3316014176ltviewed 29 July 2011gt See also Scott Cleland ldquoGooglersquos PiracyLiabilityrdquo Forbes 9 November 2011 wwwforbescomsitesscottcle-land20111109googles-piracy-liability ltviewed 11 January 2013gt

28 ldquoGovernment of Canada Web Archiverdquo Library and Archives Canada 17October 2007 wwwcollectionscanadagccawebarchivesindex-ehtmlltviewed 11 January 2013gt

29 For background see Bill Curry ldquoVisiting Library and Archives inOttawa Not Without an Appointmentrdquo Globe and Mail 1 May 2012wwwtheglobeandmailcomnewspoliticsottawa-notebookvisiting-library-and-archives-in-ottawa-not-without-an-appointmentarticle2418960 ltviewed 22 May 2012gt and the announcement ldquoLAC Begins Implementation of New Approach to service Deliveryrdquo Library and Archives Canada February 2012wwwcollectionscanadagccawhats-new013-560-ehtml ltviewed 22May 2012gt For professional responses see the ldquoCHArsquos comment onNew Directions for Library and Archives Canadardquo CHA Website March2012 wwwcha-shccaenNews_39items19html ltviewed 19December 2012gt and ldquoSave Library amp Archives Canadardquo CanadianAssociation of University Teachers launched AprilMay 2012 wwwsavelibraryarchivesca ltviewed 19 December 2012gt

30 Ian Milligan ldquoThe Smokescreen of lsquoModernizationrsquo at Library andArchives Canadardquo ActiveHistoryca 22 May 2012wwwactivehistoryca201205the-smokescreen-of-modernization-at-library-and-archives-canada ltviewed 19 December 2012gt

31 ldquoSave LAC May 2012 Campaign Updaterdquo SaveLibraryArchivesca May2012 wwwsavelibraryarchivescaupdate-2012-05aspx ltviewed 19December 2012gt

32 The challenges and opportunities presented by such projects is extensively discussed in Daniel J Cohen ldquoThe Future of Preserving the Pastrdquo CRM The Journal of Heritage Stewardship 2 no 2

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

(Summer 2005) 6ndash19 also available freely online at the RoyRosenzweig Center for History and New Mediawwwchnmgmueduessays-on-history-new-mediaessaysessayid=39For more see ldquoSeptember 11 Digital Archive Saving the Histories of September 11 2001rdquo Centre for History and New Media andAmerican Social History Project Center for Media and Learningwww911digitalarchiveorg ltviewed 11 January 2013gt ldquoHurricaneDigital Memory Bank Collecting and Preserving the Stories of Katrina and Ritardquo Center for History and New Media wwwhurricanearchiveorg ltviewed 11 January 2013gt and ldquoOccupyArchive Archiving the Occupy Movements from 2011rdquo Centre forHistory and New Media wwwoccupyarchiveorg ltviewed 11 January2013gt

33 Daniel Cohen and Roy Rosenzweig Digital History A Guide to Gathering Preserving and Presenting the Past on the Web(Philadelphia Temple University Press 2005) chapter on ldquoPreservingDigital Historyrdquo The chapter is available fully online atwwwchnmgmuedudigitalhistorypreserving5php ltviewed 19December 2012gt

34 For an overview see ldquoPreserving Our Digital Heritage The NationalDigital Information Infrastructure and Preservation Program 2010 Report a Collaborative Initiative of the Library of CongressrdquoDigitalpreservationgov wwwdigitalpreservationgovmultimediadocumentsNDIIPP2010Report_Postpdf ltviewed 19 December 2012gt

35 Steve Meloan ldquoNo Way to Run a Culturerdquo Wired 13 February 1998available online via the Internet Archivewwwwebarchiveorgweb20000619001705httpwwwwiredcomnewsculture012841030100html ltviewed 19 December 2012gt

36 Some of this has been discussed in greater depth on my website See theseries beginning with Ian Milligan ldquoWARC Files A Challenge forHistorians and Finding Needles in Haystacksrdquo ianmilliganca 12December 2012 wwwianmilliganca20121212warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks ltviewed 20 December2012gt

37 Unstructured data as opposed to structured data is important for histo-rians but brings additional challenges Some projects have producedstructured data novels marked up using XML encoding for example soone can immediately see place names dates character names and soforth This makes it much easier for a computer to read and understand

38 See William Noblett ldquoDigitization A Cautionary Talerdquo New Review ofAcademic Librarianship 17 (2011) 3 Tobias Blanke Michael Bryantand Mark Hedges ldquoOcropodium Open Source OCR for Small-ScaleHistorical Archivesrdquo Journal of Information Science 38 no 1 (2012) 83

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 14: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

preserving case studies such as the 2006 federal election Currentlyit has 4TB of federal government information and 7TB of totalinformation in its larger web archive (a figure which as I note belowis unlikely to drastically improve despite ostensible goals of archivallsquomodernizationrsquo)28

In Canada LAC has mdash on the surface mdash has been advancingan ambitious digitization program as it realizes the challenges nowfacing archivists and historians Daniel J Caron the Librarian andArchivist of Canada has been outspoken on this front with severalpublic addresses on the question of digital sources and archivesThroughout 2012 primary source digitization at LAC preoccupiedcritics who saw it as a means to depreciate on-site access TheCanadian Association of University Teachers the Canadian HistoricalAssociation as well as other professional organizations mountedcampaigns against this shift in resources29 LAC does have a pointhowever with their modernization agenda The state of Canadarsquosonline collections are small and sorely lacking when compared totheir on-site collections and LAC does need to modernize and achievethe laudable goal of reaching audiences beyond Ottawa

Nonetheless digitization has unfortunately been used as asmokescreen to depreciate overall service offerings30 The modern-ization agenda would also see 50 percent of the digitization staffcut31 LACrsquos online collections are small they do not work withonline developers through Application Programming Interfaces(APIs a way for a computer program to talk to another computerand speed up research mdash Canadianaca is incidentally a leader in thisarea) and there have been no promising indications that this state ofaffairs will change Digitization has not been comprehensive Onlinepreservation is important and historians must fight for both on-siteand on-line access

A number of historians have recognized this challenge and areplaying an instrumental role in preserving the born-digital past TheRoy Rosenzweig Center for History and New Media at GeorgeMason University launched several high-profile preservation pro-jects the September 11th Digital Archive the Hurricane DigitalMemory Bank (focusing on Hurricanes Katrina and Rita) and as ofwriting the Occupy archive32 Massive amounts of online content

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

is curated and preserved photographs news reports blog posts andnow tweets These complement more traditional efforts of collectingand preserving oral histories and personal recollections which arethen geo-tagged (tied to a particular geographical location) tran-scribed and placed online These archives serve a dual purposepreserving the past for future generations while also facilitating easydissemination for todayrsquos audiences

Digital preservation however is a topic of critical importanceMuch of our early digital heritage has been lost Many websites espe-cially those before the Internet Archiversquos 1996 web archiving projectlaunch are completely gone Early storage mediums (from tape toearly floppy disks) have become obsolete and are now nearly inac-cessible Compounding this software development has seen earlyproprietary file formats depreciated ignored and eventually madewholly inaccessible33 Fortunately the problem of digital preservationwas recognized by the 1990s and in 2000 the American governmentestablished the National Digital Information Infrastructure andPreservation Program34 As early as 1998 digital preservationistsargued ldquo[h]istorians will look back on this era hellip and see a period ofvery little information A lsquodigital gaprsquo will span from the beginningof the wide-spread use of the computer until the time we eventuallysolve this problem What wersquore all trying to do is to shorten thatgaprdquo To bear home their successes that quotation is taken from anInternet Archive snapshot of Wired magazinersquos website35

Indeed greater attention towards rigorous metadata (which isbriefly put data about data often preserved in plain text format)open-access file formats documented by the InternationalOrganization for Standardization (ISO) and dedicated digitalpreservation librarians and archivists gives us hope for the future36

It is my belief that cloud storage the online hosting of data in largethird-party offsite centres mitigates the issue of obsolete storagemedium Furthermore that the move towards open source file for-mats even amongst large commercial enterprises such as Microsoftprovides greater documentation and uniformity

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A Brief Case Study Accessing and Navigating Canadarsquos DigitalCollections With an Emphasis on Tools for Born-DigitalSources

What can we do about this digital deluge There are no simpleanswers but historians must begin to conceptualize new additions totheir traditional research and pedagogical toolkits In the section thatfollows I will do two things introduce a case study and several toolsboth proprietary and those that are open and available off-the-shelfto reveal one way forward

For a case study I have elected to use Canadarsquos DigitalCollections (CDC) a collection of lsquodeadrsquo websites removed fromtheir original location on the World Wide Web and now digitallyarchived at LAC In this section I will introduce the collection usingdigitally-archived sources from the early World Wide Web and dis-cuss my research methodology before illustrating how we can derivemeaningful data from massive arrays of unstructured data37

This is a case study of a web archive Several of these method-ologies could be applied to other formats including large arrays oftextual material whether born-digital or not A distant reading of thistype is not medium specific It does however assume that the text ishigh quality Digitized primary documents do not always have textlayers and if they are instituted computationally through OpticalCharacter Recognition errors can appear Accuracy rates for digitizedtextual documents can range from as low as 40 percent for seven-teenth and eighteenth century documents to 80ndash90 percent formodern microfilmed newspapers to over 99 percent for typesetword-processed documents38 That said while historians today maybe more interested in applying distant reading methodologies tomore traditional primary source materials it is my belief that in thenot-so-distant future websites will be an increasingly significantsource for social cultural and political historians

What are Canadarsquos Digital Collections In 1996 IndustryCanada funded by the federal Youth Employment Strategy facili-tated the development of Canadian heritage websites by youth(defined as anyone between the age of 15 and 30) It issued contractsto private corporations that agreed to use these individuals as web

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

developers for the twofold purpose of developing digital skills andinstilling respect for and engagement with Canadian heritage andpreservation39 The collection initially called SchoolNet butrebranded as Canadarsquos Digital Collections in 1999 was accessiblefrom the CDC homepage After 2004 the program was shut downand the vast majority of the websites were archived on LACrsquos web-page This collection is now the largest born-digital Internet archivethat LAC holds worthy of study as content as well as thinkingabout methodological approaches

Conventional records about the program are unavailable in thetraditional LAC collection unsurprising in light of both its recentcreation and backlogs in processing and accessioning archival mate-rials We can use one element of our born-digital toolkit the InternetArchiversquos WaybackMachine (its archive of preserved early webpages)to get a sense of what this project looked like in its heyday The mainwebpage for the CDC was wwwcollectionsicgc However a visitortoday will see nothing but a collection of broken image links (two ofthem will bring you to the English and French-language versions ofthe LAC-hosted web archive although this is only discerniblethrough trial and error or through the sitersquos source code)

With the WaybackMachine we have a powerful albeit limitedopportunity to ldquogo back in timerdquo and experience the website as previ-ous web viewers would have (if a preserved website is available ofcourse) Among the over 420 million archived websites we can searchfor and find the original CDC site Between 12 October 1999 andthe most recent snapshot or web crawl (of the now-broken site) on 6July 2011 the site was crawled 410 times40 Unfortunately every siteversion until the 15 October 2000 crawl has been somehow cor-rupted Content has been instead replaced by Chinese characterswhich translate only into place names and other random words Issuesof early digital preservation noted above have reared their head inthis collection

From late 2000 onward however fully preserved CDC sites areaccessible Featured sites new entries success stories awards curric-ular units and prominent invitations to apply for project fundingpopulate the brightly coloured website The excitement is palpableJohn Manley then Industry Minister declared ldquoit is crucial that

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Canadarsquos young people get the skills and experience they need to par-ticipate in the knowledge-based economy of the futurerdquo and thatCDC ldquoshows that young people facing significant employment chal-lenges can contribute to and learn from the resources available onthe Information Highwayrdquo41 The documentation reflecting thelower level of technical expertise and scarcity of computer hardwareat the time was comprehensive participants were encouraged toidentify a collection digitize it process it (by cleaning up the data)preserve it on a hard drive and display it on the web Issues of copy-right were discussed as well as preservation issues and hardwarerequirements for computers (a computer with an Intel 386 processor8MB RAM a 1GB hard drive and 288bps modem was recom-mended) and training information for employers educators andparticipants42 Sample proposals community liaison ideas amidstother information set out a very comprehensive roadmap for a grass-roots bottom-up approach to digital preservation and heritage

Moving through the preserved websites is an interesting snap-shot at how web design standards changed Colour disappearsreplaced by a more subdued coordinated federal website standard(still largely in effect today) Surveys greet viewers from 2003onwards trying to learn how viewers discovered the site whetherthey think it is a good educational resource (they must have beenrelieved that the majority thought so mdash in 2003 some 84 percentfelt that it was a ldquogoodrdquo resource the best of its type)43 By 2004 theprogram was engaging in active remembrance of its past programsas it neared its ten-year anniversary Proposals stopped beingaccepted in July 2004 but the 600 webpages were still accessiblewith a comprehensive subject index alphabetical listing success sto-ries and detailed copyright information

The Canadarsquos Digital Collections site began to move towardsremoval in November 2006 and was initially completely gone A visi-tor to the site would learn that the content ldquois no longer availablerdquo andan e-mail address was provided for contributors to get their informationback44The information was completely inaccessible and by early 2007a bold notice announced that the now-empty site would close as of 2April 2007 Luckily by April 2007 a notice appeared with a declara-tion that the collections were now archived by Library and Archives

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Canada just in time for the website to completely disappear by July2007 In the archived version subject indexes and headings were goneAll that remained then and today for that matter is a straight-forwardalphabetical list and a disclaimer that noted ldquo[y]ou are viewing a doc-ument archived by Library and Archives Canada Please noteinformation may be out of date and some functionality lostrdquo

The question now is what we can do with this collection ofwebsites Could a distant reading as opposed to a close reading ona targeted level tell us something about this collectionFortunately we have a variety of tools that can help us navigate thislarge array of information several of which I use to navigate largeamounts of data First I create my own tools through computerprogramming This work is showcased in the following section I program in Wolfram Researchrsquos Mathematica 9 Mathematica is anintegrated platform for technical computing that allows users toprocess visualize and interact with exceptional arrays of informa-tion45 It is a language with an engaged and generous user base thatcomplements exhaustive well-written documentation fromWolfram Research To emphasize its ease one line of code couldimport a website in perfectly readable plaintext (avoiding unicodeerrors) as follows input=Import[ldquohttpactivehis-torycaldquordquoPlaintextrdquo] and a second one could make itall in lower text for textual analysis as such lowerpage=ToLowerCase[input] Variables are a critical building blockIn the former example ldquoinputrdquo now holds the entire plaintext of thewebsite ActiveHistoryca and then ldquolowerpagerdquo holds it in lowercaseOnce we have lower-cased information we are quickly able to extractthe following frequent words frequent phrases and could thus cre-ate a dynamic database of n-grams word clouds or run other textualanalysis programs on the material Much of the syntax adheres to arecognizable English-language format

Even more promising is Mathematicarsquos integration with theWolfram|Alpha database (itself accessible at wolframalphacom) acomputational knowledge engine that promises to put incredibleamounts of information at a userrsquos fingertips With a quick com-mand one can access the temperature at Torontorsquos Lester B PearsonInternational Airport at 500 pm on 18 December 1983 the rela-

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

tive popularity of names such as ldquoEdithrdquo or ldquoEmmardquo the Canadianunemployment rate in March 1985 or the Canadian GrossDomestic Product in December 199546 This allows one to quicklycross-reference findings with statistical information or find coinci-dental points of comparison As universities are now being activelyencouraged to contribute their information to Wolfram|Alpha thiswill become an increasingly powerful database

For many researchers the downside of Mathematica is a con-siderable one It is proprietary software and relatively expensive forstudent and non-student alike Outputs however can be dissemi-nated using the Computational Document Format (CDF) enablinghistorians to put their work online and entice the general public todynamically work with their models and findings

Programming is not the be all and end all Information visual-ization in the humanities is a growing international field withCanadian scholars heavily involved The most important and acces-sible example is Voyant Tools47 It is a powerful suite of textualanalysis tools Uploading onersquos own text or collection of texts draw-ing on a webpage or analyzing existing bodies of work one sees thefollowing a ldquoword cloudrdquo of a corpus click on any word to see itsrelative rise and fall over time (word trends) access the original textand click on any word to see its contextual placement within indi-vidual sentences It ably facilitates the ldquodistancerdquo reading of anyreasonably-sized body of textual information and should be a begin-ning point for those interested in this field The only downside isthat Voyant can choke on very large amounts of information as it isa textual analysis rather than robust data mining tool

There are several other complementary tools A notable one is theMany Eyes website hosted by IBM Research Many Eyes allows you toupload datasets be they textual tabular or numerical There are manyvisualization options provided you can plot relationships betweendata compare values (ie bar charts and bubble charts) track dataover time and apply information to maps The only downside toMany Eyes is that you must register and make your datasets available toall other registered researchers as well as IBM This limits your abilityto use copyrighted or proprietary information

More advanced technical users can avail themselves of a host of

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

textual analysis programs found elsewhere on the web The mostcomprehensive gathering of these tools can be found with theSoftware Environment for the Advancement of Scholarly Research(SEASR) and its Meandre workbench Meandre which can beinstalled as a local server on your system or mdash in a best case scenariomdash hosted by an institutional IT department allows you to tap intoa variety of workflows to analyze data48 One workflow mentionedin this article is the topic modeling toolkit found within the MachineLearning for Language Toolkit (MALLET) package which clustersunstructured data into ldquotopicsrdquo which appear together giving a quicksense of a documentrsquos structure49 Other workflows found withinSEASR include Dunningrsquos log-likelihood statistic which compares aselected document to a broader reference corpus For example youcould compare the lyrics that appear in 1968 against all lyrics in thepostwar era and learn what makes 1968 most unique50 sentimentanalyses which attempt to plot changes in emotional words andphrases over time and a variety of entity extraction algorithms(which find ldquopeoplerdquo ldquoplacesrdquo and so forth in a text) Corpuses thatwere previously too big to extract meaningful information from cannow be explored relatively quickly

As befits a technical side to history all of the freely availabletools discussed briefly here (notably wget SEASR and MALLET)are discussed in more technical terms in the peer-reviewedProgramming Historian 2 or on my own blog at ianmilliganca I haveelected to leave them online rather than provide commands in thisarticle they require some technical knowledge and directly workingfrom the web is quicker than here Please consider the following sec-tion an applied version of some of the lessons you can learn throughthese processes

From Web Collection to Dynamic Finding Aids

Let us briefly return to our archival metaphor Todayrsquos historiandealing with conventional sources often has to physically travel to anarchive requisition boxes perhaps take photographs of materials iftime is tight and otherwise take notes With born-digital sourceshowever the formula is reversed Our sources can come to us In a

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

few keystrokes we can bring an unparalleled quantity of informationto our home or work computers With Canadarsquos Digital Collectionsa single command line entry brought the entire archive onto my sys-tem it took days to process the information and I had to leave mycomputer on but it did not require any active user engagement

How big is the CDC collection in terms of sheer informationThe plaintext without any HTML markup is approximately7805135 words If each page contained 250 words it would be theequivalent of a 31220 page essay or 1000 dissertations However itis even more complicated than just reading 1000 dissertations Theinformation is spread across 73585 separate HTML files themselvespart of a much broader corpus of 278785 files Even the sheer sizeof the database requires big data methodologies an aggregate size of726 GB 360 MB of plaintext with HTML encoding reduced to alsquomerersquo 50MB if all encoding is removed A 50MB plain text docu-ment can quickly slow down even a relatively powerful personalcomputer There is no simple way to make this a conventionallyaccessible amount of information

How can we access this information Several different methodsare possible First we can work with the information on the internetitself or we can download it onto our own system I prefer the lattermethod as it decreases computational time adds flexibility andenables free rein to experiment with material Before you do thisthough you need to be familiar with the websitersquos ldquoterms of servicerdquoor the legal documentation that outlines how users are allowed to usea website resources or software In this case LAC provides generousnon-commercial reproduction terms information ldquomay be repro-duced in part or in whole and by any means without charge orfurther permission unless otherwise specifiedrdquo51

The CDC web archive index is organized like so the index islocated at wwwepelac-bacgcca100205301iccdcEAlphabetaspwith every subsequent website branching off the cdc folder Forexample one website would be hosted in wwwepelac-bacgcca100205301iccdcABbenevolat the next in hellipcdcabnatureand so on Using wget a free software program you can downloadall websites in a certain structure the program recursively downloadsinformation in a methodical fashion I first used wget to save the

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

cookies (little files kept on your system in this case one that notedthat you had seen the disclaimer about the archival nature of thewebsite) and then with an one-line command in my command lineto download the entire database52 Due to the imposition of a ran-dom wait and bandwidth limit which meant that I would downloadone file every half to one and a half seconds at a maximum of 30kilobytes a second the process took six days In the ProgrammingHistorian 2 I have an article expanding upon wget for humanitiesresearchers

Hypertext itself deserves a brief discussion The basic buildingblock of the web is the link one page is connected to another andpeople navigate through these webs While books and articles forexample are connected to each other by virtue of ideas and concretecitations these must often be inferred online HTML born-digitalsources are different in that the link offers a very concrete file One

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Visualizing the CDC Collection as a Series of Hyperlinks

quick visualization the circular cluster seen above lets us see theunique nature of this information We can go to each web pagescrape the links and see where they take viewers As we see in thebelow figure the CDC collection can be visualized as a series ofhyperlinks (each line) The cluster in the middle is the CDC collec-tion itself whereas the strands on the outside represent externalwebsites Using Mathematica I have generated a dynamic modelyou can move your mouse over each line learn what link it is andthus see where the collection fits into the broader Internet The siteis limited to following two links beyond the first page Through thisapproach we see the interconnection of knowledge and relationshipsbetween different sources At a glimpse we can see suggestions to fol-low other sites and determine if any nodes are shared amongst thevarious websites revealed

Visualization Moving Beyond Word Clouds

Where to begin with a large collection such as the Canadarsquos DigitalCollection corpus As a historian who is primarily interested in thewritten word my first step was to isolate the plaintext throughout thematerial so that I could begin rudimentary data mining and textualanalysis on the corpus For this I approached the CDC collection ontwo distinct levels mdash first the aggregate collection and second divid-ing the collection into each lsquofilersquo or individual website mdash derivingword and phrase data for each Imagine if you could look throughinto an archival collection and get a quick sense of what topics andideas were covered Taking that thought further imagine the utility ifyou could do this with individual archival boxes and even files Witha plethora of born-digital sources this is now a necessary step

On an aggregate level word and phrase data is a way to explorethe information within How often do various words within thenearly eight million words of the collection appear UsingMathematica we can quickly generate a list of words sorted by theirrelative frequency we learn that the word ldquoCanadardquo appears 26429times ldquocollectionsrdquo 11436 and ldquo[L]ouisbourgrdquo an astounding9365 times Reading such a list can quickly wear out a reader and isof minimal visual attractiveness Two quick decisions were made to

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

enhance data usability first the information was ldquocase-foldedrdquo allsent to lower-case this enables us to count the total appearances of aword such as ldquodigitalrdquo regardless of whether or not it appears at thebeginning of a sentence

Second when dealing with word frequency mdash as opposed tophrase frequency mdash ldquostop wordsrdquo should be removed as they werehere These are ldquoextremely common words which would appear to beof little value in helping select documents matching a user needrdquowords such as ldquotherdquo ldquoandrdquo ldquotordquo and so forth They matter forphrases (the example ldquoto be or not to berdquo is comprised entirely ofstop words) but will quickly dominate any word-level visualizationor list53 Other words may need to be added or removed from stoplists depending on the corpus for example if the phrase ldquopage num-berrdquo appeared on every page that would be useless information forthe researcher

The first step then is to use Mathematica to format word fre-quency information into Wordlenet readable format (if we put allof the text into that website it will crash mdash so we need to followthe instructions on that website) we can then draw up a wordcloud of the entire collection Here we can see both the advantagesand disadvantages of static word clouds What can we learn from this

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A Word Cloud of All Words in the CDC Corpus

illustration First without knowing anything about the sites we canimmediately learn that it is dealing with CanadaCanadians that itis a collection with documents in French and English that we aredealing with digitized collections and that there is a fair amount ofregional representation not based upon population (note thatldquoontariordquo is less frequent than both ldquonewfoundlandrdquo ldquonova scotiardquoand ldquoalbertardquo although ldquoqueacutebecrdquo and ldquoquebecrdquo is not terribly promi-nent) Certain kinds of history dominate communities familynames the built form (houses churches schools and so forth) At avery macro level we can learn several characteristics before doing anydeeper investigation

These forms of visualization need to be used with extreme cau-tion however First while visually attractive it leaves a severeamount of ambiguity are ldquoontariordquo and ldquoqueacutebecrdquo equally promi-nent for example or is there a slight divergence Even moreproblematically we cannot learn context from the above picturedoes the word ldquoschoolrdquo refer to built structures or educational expe-riences The same question with applies to ldquochurchrdquo Do mentionsof ldquonewfoundlandrdquo refer to historical experiences there or withexpatriates or in travel narratives Word clouds are useful first stepsand have their place but need to be used with extreme caution Formore scholarly applications tables offer much more precise inter-pretation As the second figure below demonstrates they can bemore helpful when dealing with the individual sites cuttingthrough ambiguity The figure below is a visualization of the web-site ldquopast and presentrdquo Despite the ambiguous title we can quicklylearn what this website contains from this visualization Wordle isthus a useful first step analogous to taking magically taking a quicklook through the side of an archival box We still need more infor-mation however to really make this form of textual analysishelpful To navigate this problem I have created my own dynamicfinding aid program using Mathematica which scales to any born-digital web collection

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Two approaches were taken First a textual representation of wordand phrase frequency allows us to determine the content of a given filerelatively quickly I wrote a short program to extract word frequency(stop words excluded) and phrase frequency For phrases we speak ofn-grams bigrams trigrams quadgrams and fivegrams although onecould theoretically do any higher number A bigram can be a combina-tion of two characters such as lsquobirsquo two syllables or mdash pertinent in ourcase mdash two words An example ldquocanada isrdquo A trigram is three wordsquadgram is four words and a fivegram is five words and so on

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Word Cloud of the File ldquoPast and Presentrdquo

A viewer for n-grams in the CDC Collection

The CDC collection demonstrates this approachrsquos utility Usingmy Mathematica textual CDC browser pictured above I randomlyexplored file 267 the ambiguously named ldquopast and presentrdquo web-site This is one of the largest individual websites 122363 words orabout 493 standard typed pages Such an ambiguous name could beabout almost anything Initially I decided to study word frequency

ldquoldquoalbertardquo 758rdquo ldquoldquolistenrdquo 490rdquo ldquoldquosettlersrdquo 467rdquoldquoldquonewrdquo 454rdquo ldquoldquocanadardquo 388rdquo ldquoldquolandrdquo 362rdquo ldquoldquoreadrdquo328rdquo ldquoldquopeoplerdquo 295rdquo ldquoldquosettlementrdquo 276rdquo ldquoldquochineserdquo268rdquo ldquoldquocanadianrdquo 264rdquo ldquoldquoplacerdquo 255rdquo ldquoldquoheritagerdquo254rdquo ldquoldquoedmontonrdquo 250rdquo ldquoldquoyearsrdquo 238rdquo ldquoldquowestrdquo 236rdquoldquoldquoraymondrdquo 218rdquo ldquoldquoukrainianrdquo 216rdquo ldquoldquocamerdquo 213rdquoldquoldquocalgaryrdquo 209rdquo ldquoldquonamesrdquo 206rdquo ldquoldquoranchrdquo 204rdquoldquoldquoupdatedrdquo 193rdquo ldquoldquohistoryrdquo 190rdquo ldquoldquocommunitiesrdquo183rdquo hellip

Rather than having to visit the website at a glance we can almostimmediately see what the site is about More topics become apparentwhen one looks at trigrams for example in this automatically gen-erated output of generated top phrases

ldquoldquofirstrdquo ldquopeoplerdquo ldquosettlersrdquo 100rdquo ldquoldquotherdquo ldquounitedrdquoldquostatesrdquo 72rdquo ldquoldquoonerdquo ldquoofrdquo ldquotherdquo 59rdquo ldquoldquoofrdquo ldquotherdquoldquowestrdquo 58rdquo ldquoldquosettlersrdquo ldquolastrdquo ldquoupdatedrdquo 56rdquo ldquoldquopeoplerdquoldquosettlersrdquo ldquolastrdquo 56rdquo ldquoldquoalbertansrdquo ldquofirstrdquo ldquopeoplerdquo 56rdquoldquoldquoadventurousrdquo ldquoalbertansrdquo ldquofirstrdquo 56rdquo ldquoldquotherdquo ldquobarrdquoldquourdquo 55rdquo ldquoldquoopeningrdquo ldquoofrdquo ldquotherdquo 54rdquo ldquoldquocommunitiesrdquoldquoadventurousrdquo ldquoalbertansrdquo 54rdquo ldquoldquonewrdquo ldquocommunitiesrdquoldquoadventurousrdquo 54rdquo ldquoldquobarrdquo ldquourdquo ldquoranchrdquo 53rdquo ldquoldquolistenrdquoldquotordquo ldquotherdquo 52rdquo hellip

Through a few clicks we can navigate and determine the basiccontours mdash from a distance mdash of what the site pertains to In aviewer you can use a slider to dynamically see more informationor less information quickly toggling between bigrams trigramsand quadgrams allowing you to quickly hone in on immediately

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

relevant information This can be useful but we need to moveback even further to achieve a distant reading of this born-digitalcollection

If one zooms back out to this very distant level we can extractn-gram data from the entire collection We have two approachesfirst aggregate appearances of a given term (if ldquoaboriginalrdquo appears457 times then n=457) or second relative appearances of a giventerm (if ldquoaboriginalrdquo appears in 45 of 384 sites then n=45384 or01171875) In my other work on music lyrics the latter optionenables you to control for the lsquochorusrsquo effect as the former can be drowned out by one song with repetitive words Given thenature of this collection I elected for the former approach Whatcan we learn from trigrams for example After an initial set of trigrams pertaining to the Louisbourg digitization project whichhas boilerplate language on each page as well as common ldquostop-wordrdquo equivalents of phrases (especially pronounced in French) we come across the first set of meaningful phrases excerptedbelow

ldquoexperiencerdquo ldquoinrdquo ldquotherdquo 919 ldquotransferredrdquo ldquotordquo ldquotherdquo918 ldquofundedrdquo ldquobyrdquo ldquotherdquo 918 ldquoofrdquo ldquotherdquo ldquoprovincialrdquo915 ldquothisrdquo ldquowebrdquo ldquositerdquo 915 ldquo409rdquo ldquoregistryrdquo ldquo2rdquo914 ldquoarerdquo ldquonordquo ldquolongerrdquo 914 ldquogouvernementrdquo ldquodurdquo ldquocanadardquo 910 ldquouserdquo ldquothisrdquo ldquolinkrdquo 887ldquopagerdquo ldquoorrdquo ldquoyourdquo 887 ldquofollowingrdquo ldquopagerdquo ldquoorrdquo 887 ldquoautomaticallyrdquo ldquotransferredrdquo ldquotordquo 887 ldquoberdquoldquoautomaticallyrdquo ldquotransferredrdquo 887 ldquowillrdquo ldquoberdquo ldquoautomaticallyrdquo 887 ldquoindustryrdquo ldquoyourdquo ldquowillrdquo 887ldquomultimediardquo ldquoindustryrdquo ldquoyourdquo 887 ldquopracticalrdquo ldquoexperiencerdquo ldquoinrdquo 887 ldquogainedrdquo ldquopracticalrdquo ldquoexperiencerdquo887 ldquotheyrdquo ldquogainedrdquo ldquopracticalrdquo 887 ldquowhilerdquo ldquotheyrdquoldquogainedrdquo 887 ldquocanadiansrdquo ldquowhilerdquo ldquotheyrdquo 887ldquoyoungrdquo ldquocanadiansrdquo ldquowhilerdquo 887 ldquoshowcaserdquo ldquoofrdquoldquoworkrdquo 887 ldquoexcellentrdquo ldquoshowcaserdquo ldquoofrdquo 887 hellip

Here we see an array of common language funding informa-tion link information project information spread across over 887

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

websites Reading down we see information about governments jobexperience project information initiatives re-structuring and soforth At a glance we can see what language is common to all web-sites as opposed to unique information Moving down we find themore common but less universal ldquothe history ofrdquo for example ldquoher-itage community foundationrdquo libraries and so forth By the end ofthe list we can move into the unique or very rare phrases that appearonly once twice or three times

Moving beyond word and phrase frequency there are a fewmore sophisticated off-the-shelf options that can help us make senseof large arrays of unstructured data Topic modeling is among themost developed as Princetonrsquos David M Blei (a Computer Scienceprofessor) describes in a general introduction to the topic

While more and more texts are available online we simplydo not have the human power to read and study them toprovide the kind of [traditional] browsing experiencedescribed above To this end machine learning researchershave developed probabilistic topic modeling a suite of algo-rithms that aim to discover and annotate large archives ofdocuments with thematic information Topic modelingalgorithms are statistical methods that analyze the words ofthe original texts to discover the themes that run throughthem how those themes are connected to each other andhow they change over time hellip Topic modeling enables usto organize and summarize electronic archives at a scalethat would be impossible by human annotation55

Using the aforementioned SEASR Meandre workbench (or viaMALLET itself and a command-line interface) we can quickly beginto run our born-digital information through the topic-modelingalgorithm The results are notable and come in two formats a wordcloud visualization as well as an XML file containing a line-by-linebreakdown of each topic Taking the aforementioned file 267 (ldquopastand presentrdquo) we can begin to get a sense of the topics this large web-site discusses

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

At a glance we can learn about some of the main themes the struc-ture of Alberta (Alberta towns bridges lakes) the geographic shapeof the area (life land lands areas links grass and interestingly lan-guage) the stories (story knight archives but also interestinglyknight and wife) and so forth The word cloud gives us a sense ofthe most relevant terms that define each topic but we also have a3200-line XML file broken into topics which lets us further refineour inquiry

A last set of examples will help us move from the aggregate levelof hundreds of thousands of lines of text to the individual line inessence creating our own search engine to navigate this collection

There are a few good reasons why we should not use Google todo all of this in our own work First we do not know how Googleworks precisely It is a proprietary black box While the original algo-rithm is available the actual methods that go into determining theranking of search terms is a trade secret Second Google is optimizedfor a number of uses mainly to deliver the most relevant informa-tion to the user Historians and archivists are looking often forspecific information within old websites and it behooves us to exam-ine them through the lens of more specialized tools With thesehesitations in mind I have sought to develop my own rudimentarysearch engine for this born-digital collection

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A two-pronged approach can take us to relevant informationquickly In the following example I am carrying out a project onblack history We quickly search all of the CDC word frequency filesfor the occurrence of the word ldquofreedomrdquo ldquoblackrdquo and ldquolibertyrdquo(these terms of course require some background knowledge thesearch is shaped by the user) obtain a list and then map it onto achart in order to quickly visualize their locations in the collection Forexample in the graph above we see a visualization of a given term inthis case our three search terms As one moves the mouse over eachbar the frequency data for a given term appears We can also zoomand enhance this graph to use it as we wish A demonstration of thiscan be found online at wwwianmilliganca20130118cdc-tool-1

We can then move down into the data In the example aboveldquo047rdquo labeled ldquoblackgoldtxtrdquo contains 31 occurrences of the wordldquoontariordquo Moving to my Mathematica keyword-in-context routinewe can see where these occurrences of the word appear

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

A snapshot of the dynamic finding aid tool

We see that this website focuses mostly on southwestern Ontariowhere many oil fields are If we move forward we can see that it per-tains to the nineteenth century a particular region as well assubsequent memorialization activities (mention of a museum forexample) All this can be done from within the textual client asopposed to going out into individual websites Broad contoursemerge from this distant reading but they also facilitate deeper read-ing by allowing us to promptly realize critical sources

Another model through which we can visualize documents isthrough a viewer primarily focusing on term document frequencyand inverse frequency terms discussed below To prepare this mater-ial we are combining our earlier data about word frequency (that theword ldquoexperiencerdquo appears 919 times perhaps) with aggregate infor-mation about frequency in the entire corpus (ie the wordldquoexperiencerdquo appears 1766 times) The number of times that a termappears in a given website can be defined as tf (term frequency)Another opportunity is inverse document frequency where we canlearn what words are most distinct and unique in a given documentThe formula for this is the oft-cited tf-idf a staple of informationretrieval theory or tf times the logarithm of the number of docu-ments divided into the number of documents where the given termappears At a glance then we can see some of the unique words con-tained within a website if we select ldquoinverse document frequencyrdquo

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

At a very quick glance one can see the major words that make this aunique document This technique can be used on any textual docu-ment

Off-the-shelf tools can facilitate a deeper yet distant reading ofindividual websites as well especially the suite of research tools onwebsites such as Many Eyes and Voyant Using these tools howeverrequires that one puts individual data into each separate programrather than quickly navigating between them This stems from themassive amount of files and data involved Many Eyes limits files to5MB and Voyant has no hard and fast rule but can struggle withlarger files

This analysis was carried out on my personal computer55 Largerdata sets may necessitate a corresponding increase in computationalpower bringing social sciences and humanities researchers into con-tact with High Performance Computing (HPC) centres or consortiaThese extremely powerful computers provide researchers with com-puting power orders of magnitude above that of personal computersallowing results in hours instead of the weeks or months it might

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

take to process on a smaller system56 This has particular applicationin the basic processing of a very large web archive There are limita-tions however jobs must be submitted into queues in advancerather than on-site processing and perhaps most importantly HPCcentres prefer to run highly optimized code to minimize usage oftheir very expensive and scarcely allocated equipment As this level ofcoding expertise is often beyond humanities researchers researchteams must be formed

Humanities researchers would do well to familiarize themselveswith local HPC options many of which are more than happy towork on new and engaging questions These are accessible optionsin Ontario for example all but one university belongs to a HPCconsortium57 Yet as personal computing continues to dramaticallyincrease in speed and as storage prices continue to plummet textualanalysis is increasingly feasible on high-quality office systems An ini-tial exposure can generate questions and problems that if the needrears itself can subsequently be taken to the supercomputer

This case study is intended to provide one model for distantreading and the generation of finding aids in the interest of showingsome of the potential for working with born-digital sources It isintended to begin or jumpstart a conversation around the resourcesand skills necessary for the next generation A seemingly insur-mountable amount of information contained across over 73000files (as in the CDC collection) can be quickly navigated and iso-lated for researcher convenience

Conclusion Digital Literacy for the Next Generation ofHistorians

As a discipline I would like to suggest one path forward prepare thenext generation of historians at least some of them to deal with thedeluge of born-digital sources Fundamentally the challenge of born-digital sources is a pedagogical one For existing historians this is anissue of continual education for future historians an issue of profes-sional training (the annual Digital Humanities Summer Instituteheld at the University of Victoria each June is worth considering asan introduction to many digital concepts) Some off-the-shelf tools

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

can serve as gateways to deeper engagement notably the VoyantTools suite (wwwvoyant-toolsorg) and perhaps subsequently thelessons contained in the Programming Historian 2 That said digitalliteracy courses should become part of the essential pantheon of his-torical curricula This should not be mandatory but instead taughtalongside and within traditional historiography and research meth-ods courses In an era of diminishing resources and attention paid toknowledge translation and relevancy digital literacy courses are alsowell positioned to provide graduates with concrete portfolios andactionable skills If we want to be prepared to use digital method-ologies to study the post-1990 world we will need ready historians

Such a course should have six basic priorities First we must pro-vide historiographical context for new students on whatcomputational methods have been tried before How have quantita-tive and qualitative methods differed yet ultimately complementedeach other Second students in this course would need to develop anunderstanding of the digital basics cloud computing effectivebackup and versioning digital security and an understanding of howthe Internet is organized Third for effective cross-departmental andsustained engagement with digital history students must to learn howto digitally navigate conventional sources notable skill sets wouldinclude a familiarity with citation management software databasesand the possibility of using optical character recognition on a varietyof born-paper sources subsequently digitized through scanners orcameras This must also include a grasp of the scope and significanceof large digital depositories including Google Books the InternetArchive and others mentioned earlier Fourth students must learnhow to process large amounts of digital information this includes off-the-shelf visualization tools material discussed here and some of thephilosophical and methodological implications of a distance readingFifth students can learn basic programming through the ProgrammingHistorian Finally and perhaps most importantly students need anunderstanding of various ways to openly disseminate material on the web whether through blogging micro-blogging publishing webpages and so forth it is important to enable the great work that we do in the university to make its mark on the world Throughthis students will also come away from history departments with

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

comprehensive and marketable digital portfolios Such a coursewould focus more on homework than reading with special emphasisplaced on deliverable content such as websites interactive modelsand broadly implementable databases

Digital history and the readiness to work with born-digitalsources cannot stem out of pedagogy alone however Institutionalsupport is necessary While based on anecdotal evidence it appearsthat some university administrators have begun to embrace the shifttowards the digital humanities At the federal level SSHRC has iden-tified the Digital Economy as one of its priority areas58 Yet when itcomes to evaluation doctoral and post-doctoral researchers do nothave a ldquodigital humanitiesrdquo option open to them in terms of assessingcommittees leading to fears that their projects might be lost inbetween disciplinary areas of focus59 In light of the impending crisisof born-digital sources historians should encourage these initiatives

These sorts of digital projects do not always lend themselveswell to traditional publication venues The examples above from theCDC case study for example are best viewed as dynamically mal-leable models this will be one challenge as historians grapple withnew publication models This article argues first why digitalmethodologies are necessary and second how we can realize aspectsof this call The Internet is 20 years old and before historians realizeit we will be faced with the crisis of born-digital information over-load If history is to continue as the leading discipline inunderstanding the social and cultural past decisive movementtowards the digital humanities is necessary When the sea comes inwe must be ready to swim

IAN MILLIGAN is an assistant professor of Canadian history at theUniversity of Waterloo He is a founding co-editor of ActiveHistorycaan editor-at-large of the Programming Historian 2 and has written sev-eral articles in the fields of Canadian youth digital and labour history

IAN MILLIGAN est professeur adjoint en histoire canadienne agrave laUniversity of Waterloo Il est le fondateur et codirecteur du site

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Internet ActiveHistoryca ainsi qursquoun eacutediteur du site ProgrammingHistorian 2 Il a eacutecrit plusieurs articles au sujet de la jeunesse cana-dienne de la numeacuterisation et de lrsquohistoire des travailleurs

Endnotes

1 The question of when the past becomes a ldquohistoryrdquo to be professionallystudied is a tricky one Yet it is worth noting that the still-definitivework on the Baby Boom Generation appeared in 1996 with DougOwram Born at the Right Time A History of the Baby Boom Generationin Canada (Toronto University of Toronto Press 1996) This actuallyfollowed the earlier scholarly monograph Cyril Levitt Children ofPrivilege Student Revolt in the Sixties (Toronto University of TorontoPress 1984) and a host of other laudable works As academicdebatesworks on the 1960s now suggest the past becomes historysooner than we often expect

2 A critical point advanced in the field of literary criticism and theory byStephen Ramsay Reading Machines Towards an Algorithmic Criticism(Urbana University of Illinois Press 2011)

3 See ldquoIdleNoMore Timelinerdquo ongoingwwwidlenomoremakookcatimeline44 ltviewed 18 January 2013gt

4 These issues have been discussed elsewhere albeit in either a moreabridged form or with a focus on digitized historical sources A notableexception is Daniel J Cohen and Roy Rosenzweig Digital History AGuide to Gathering Preserving and Presenting the Past on the Web(Philadelphia University of Pennsylvania Press 2005) specifically theirchapter on ldquoCollecting History onlinerdquo Their focus is on the preservation-ist and archival challenges however Important notes are also found inDan Cohen ldquoHistory and the Second Decade of the Webrdquo RethinkingHistory 8 no 2 (June 2004) 293ndash301 also available online atwwwchnmgmueduessays-on-history-new-mediaessaysessayid=34

5 Figure taken from Jose Martinez ldquoTwitter CEO Dick Costolo Reveals Staggering Number of Tweets Per Dayrdquo ComplexTech wwwcomplexcomtech201210twitter-ceo-dick-costolo-reveals-staggering-number-of-tweets-per-day ltviewed 6 January 2013gtFurther more while a tweet itself is only 140 characters every tweetcomes with over 40 lines of metadata date information author biogra-phy and location information when the account was created the usersrsquofollowing and follower numbers country data and so forth This is aremarkable treasure trove of information for future historians but also apotential source of information overload See Sarah Perez ldquoThis is What

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

a Tweet Looks Likerdquo ReadWriteWeb (19 April 2010) wwwreadwritewebcomarchivesthis_is_what_a_tweet_looks_likephpltviewed 6 January 2013gt

6 This is further discussed in Ian Foster ldquoHow Computation ChangesResearchrdquo in Switching Codes Thinking Through Digital Technology inthe Humanities and the Arts eds Thomas Bartscherer and RoderickCoover (Chicago University of Chicago Press 2011) esp 22ndash4

7 Franco Moretti Graphs Maps Trees Abstract Models for Literary History(New York Verso 2005) 3ndash4

8 While not of prime focus here the longue dureacutee deserves definition Itstands opposite the history of events of instances covering instead avery long time span in an interdisciplinary social science framework In along essay Braudel noted that one can then see both structural crises ofeconomies and structures that constrain human society and develop-ment For more see Fernand Braudel ldquoHistory and the Social SciencesThe Longue Dureacuteerdquo in Fernand Braudel On History trans SarahMatthews (Chicago University of Chicago Press 1980) 27 The essaywas originally printed in the Annales ESC no 4 (OctoberndashDecember1958)

9 A great overview can be found at ldquoText Analysisrdquo Tooling Up for DigitalHumanities 2011 wwwtoolingupstanfordedupage_id=981 ltviewed19 December 2012gt

10 Jean-Baptiste Michel Yuan Kui Shen Aviva Presser Aiden Adrian VeresMatthew K Gray The Google Books Team Joseph P Pickett DaleHoiberg Dan Clancy Peter Norvig Jon Orwant Steven Pinker MartinA Nowak Erez Lieberman Aiden ldquoQuantitative Analysis of CultureUsing Millions of Digitized Booksrdquo Science 331 (14 January 2011)176ndash82 The tool is publicly accessible at ldquoGoogle Books NgramViewerrdquo Google Books 2010 wwwbooksgooglecomngrams ltviewed11 January 2013gt

11 Anthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

12 Comment by Jean-Baptiste Michel and Erez Lieberman Aiden onAnthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

13 For more information see the Canadian Century Research InfrastructureProject website at wwwccriuottawacaCCRIHomehtml and the Programme de recherche en deacutemographie historique at wwwgeneologyumontrealcafrleprdhhtm

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

14 ldquoWith Criminal Intentrdquo CriminalIntentorg wwwcriminalintentorgltviewed 20 December 2012gt

15 For an overview see ldquoAward Recipients mdash 2011rdquo Digging into DataWebpage wwwdiggingintodataorgHomeAwardRecipients2011tabid185Defaultaspx ltviewed 20 December 2012gt

16 One can do no better than the completely free Programming Historianwhich uses entirely free and open-source programming tools to learn thebasics of computational history See The Programming Historian 2 2012-Present wwwprogramminghistorianorg ltviewed 11 January 2013gt

17 Michael Katz The People of Hamilton Canada West Family and Class ina Mid-Nineteenth-Century City (Cambridge MA Harvard UniversityPress 1975) Also see A Gordon Darroch and Michael D OrnsteinldquoEthnicity and Occupational Structure in Canada in 1871 The VerticalMosaic in Historical Perspectiverdquo Canadian Historical Review 61 no 3(1980) 305ndash33

18 Ian Anderson ldquoHistory and Computingrdquo Making History The ChangingFace of the Profession in Britain 2008 wwwhistoryacukmakinghistoryresourcesarticleshistory_and_computinghtml ltviewed 20 December2012gt

19 Christopher Collins ldquoText and Document Visualizationrdquo TheHumanities and Technology Camp Greater Toronto Area VisualizationBootcamp University of Ontario Institute of Technology Oshawa Ont21 October 2011

20 The quotations by Luther and Poe are commonly used to make thispoint For the best example see Clay Shirky Cognitive SurplusCreativity and Generosity in a Connected Age Google eBook Penguin2010 For Mumford see Lewis Mumford The Myth of the Machine vol2 The Pentagon of Power (New York Harcourt Brace 1970) 182 asquoted in Gleick The Information A History A Theory A Flood (NewYork Pantheon 2011) 404

21 Adam Ostrow ldquoCisco CRS-3 Download Library of Congress in JustOver One Secondrdquo Mashable (9 March 2010) accessed onlinewwwmashablecom20100309cisco-crs-3

22 A note on data size is necessary The basic unit of information is a ldquobyterdquo (8 bits) which is roughly a single character A kilobyte (KB) is athousand bytes a megabyte (MB) a million a gigabyte (GB) a billion aterabyte (TB) a trillion and a petabyte (PB) a quadrillion Beyond thisthere are exabytes (EB) zettabytes (ZB) and yottabytes (YB)

23 These figures can of course vary depending on compression methodschosen Their math was based on an average of 8MB per book if it wasscanned at 600dpi and then compressed There are 26 million books inthe Library of Congress For more see Mike Ashenfelder ldquoTransferringlsquoLibraries of Congressrsquo of Datardquo Library of Congress Digital Preservation

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Blog 11 July 2011wwwblogslocgovdigitalpreservation201107transferring-libraries-of-congress-of-data ltviewed 11 January 2013gt

24 Gleick 232 25 ldquoAbout the Library of Congress Web Archivesrdquo Library of Congress Web

Archiving FAQs Undated wwwlocgovwebarchivingfaqhtmlfaqs_02ltviewed 11 January 2013gt

26 ldquo10000000000000000 Bytes Archivedrdquo Internet Archive Blog 26October 2012 wwwblogarchiveorg2012102610000000000000000-bytes-archivedltviewed 19 December 2012gt

27 Jean-Baptiste Michel et al 176ndash82 Published Online Ahead of Print16 December 2010 wwwsciencemagorgcontent3316014176ltviewed 29 July 2011gt See also Scott Cleland ldquoGooglersquos PiracyLiabilityrdquo Forbes 9 November 2011 wwwforbescomsitesscottcle-land20111109googles-piracy-liability ltviewed 11 January 2013gt

28 ldquoGovernment of Canada Web Archiverdquo Library and Archives Canada 17October 2007 wwwcollectionscanadagccawebarchivesindex-ehtmlltviewed 11 January 2013gt

29 For background see Bill Curry ldquoVisiting Library and Archives inOttawa Not Without an Appointmentrdquo Globe and Mail 1 May 2012wwwtheglobeandmailcomnewspoliticsottawa-notebookvisiting-library-and-archives-in-ottawa-not-without-an-appointmentarticle2418960 ltviewed 22 May 2012gt and the announcement ldquoLAC Begins Implementation of New Approach to service Deliveryrdquo Library and Archives Canada February 2012wwwcollectionscanadagccawhats-new013-560-ehtml ltviewed 22May 2012gt For professional responses see the ldquoCHArsquos comment onNew Directions for Library and Archives Canadardquo CHA Website March2012 wwwcha-shccaenNews_39items19html ltviewed 19December 2012gt and ldquoSave Library amp Archives Canadardquo CanadianAssociation of University Teachers launched AprilMay 2012 wwwsavelibraryarchivesca ltviewed 19 December 2012gt

30 Ian Milligan ldquoThe Smokescreen of lsquoModernizationrsquo at Library andArchives Canadardquo ActiveHistoryca 22 May 2012wwwactivehistoryca201205the-smokescreen-of-modernization-at-library-and-archives-canada ltviewed 19 December 2012gt

31 ldquoSave LAC May 2012 Campaign Updaterdquo SaveLibraryArchivesca May2012 wwwsavelibraryarchivescaupdate-2012-05aspx ltviewed 19December 2012gt

32 The challenges and opportunities presented by such projects is extensively discussed in Daniel J Cohen ldquoThe Future of Preserving the Pastrdquo CRM The Journal of Heritage Stewardship 2 no 2

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

(Summer 2005) 6ndash19 also available freely online at the RoyRosenzweig Center for History and New Mediawwwchnmgmueduessays-on-history-new-mediaessaysessayid=39For more see ldquoSeptember 11 Digital Archive Saving the Histories of September 11 2001rdquo Centre for History and New Media andAmerican Social History Project Center for Media and Learningwww911digitalarchiveorg ltviewed 11 January 2013gt ldquoHurricaneDigital Memory Bank Collecting and Preserving the Stories of Katrina and Ritardquo Center for History and New Media wwwhurricanearchiveorg ltviewed 11 January 2013gt and ldquoOccupyArchive Archiving the Occupy Movements from 2011rdquo Centre forHistory and New Media wwwoccupyarchiveorg ltviewed 11 January2013gt

33 Daniel Cohen and Roy Rosenzweig Digital History A Guide to Gathering Preserving and Presenting the Past on the Web(Philadelphia Temple University Press 2005) chapter on ldquoPreservingDigital Historyrdquo The chapter is available fully online atwwwchnmgmuedudigitalhistorypreserving5php ltviewed 19December 2012gt

34 For an overview see ldquoPreserving Our Digital Heritage The NationalDigital Information Infrastructure and Preservation Program 2010 Report a Collaborative Initiative of the Library of CongressrdquoDigitalpreservationgov wwwdigitalpreservationgovmultimediadocumentsNDIIPP2010Report_Postpdf ltviewed 19 December 2012gt

35 Steve Meloan ldquoNo Way to Run a Culturerdquo Wired 13 February 1998available online via the Internet Archivewwwwebarchiveorgweb20000619001705httpwwwwiredcomnewsculture012841030100html ltviewed 19 December 2012gt

36 Some of this has been discussed in greater depth on my website See theseries beginning with Ian Milligan ldquoWARC Files A Challenge forHistorians and Finding Needles in Haystacksrdquo ianmilliganca 12December 2012 wwwianmilliganca20121212warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks ltviewed 20 December2012gt

37 Unstructured data as opposed to structured data is important for histo-rians but brings additional challenges Some projects have producedstructured data novels marked up using XML encoding for example soone can immediately see place names dates character names and soforth This makes it much easier for a computer to read and understand

38 See William Noblett ldquoDigitization A Cautionary Talerdquo New Review ofAcademic Librarianship 17 (2011) 3 Tobias Blanke Michael Bryantand Mark Hedges ldquoOcropodium Open Source OCR for Small-ScaleHistorical Archivesrdquo Journal of Information Science 38 no 1 (2012) 83

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 15: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

is curated and preserved photographs news reports blog posts andnow tweets These complement more traditional efforts of collectingand preserving oral histories and personal recollections which arethen geo-tagged (tied to a particular geographical location) tran-scribed and placed online These archives serve a dual purposepreserving the past for future generations while also facilitating easydissemination for todayrsquos audiences

Digital preservation however is a topic of critical importanceMuch of our early digital heritage has been lost Many websites espe-cially those before the Internet Archiversquos 1996 web archiving projectlaunch are completely gone Early storage mediums (from tape toearly floppy disks) have become obsolete and are now nearly inac-cessible Compounding this software development has seen earlyproprietary file formats depreciated ignored and eventually madewholly inaccessible33 Fortunately the problem of digital preservationwas recognized by the 1990s and in 2000 the American governmentestablished the National Digital Information Infrastructure andPreservation Program34 As early as 1998 digital preservationistsargued ldquo[h]istorians will look back on this era hellip and see a period ofvery little information A lsquodigital gaprsquo will span from the beginningof the wide-spread use of the computer until the time we eventuallysolve this problem What wersquore all trying to do is to shorten thatgaprdquo To bear home their successes that quotation is taken from anInternet Archive snapshot of Wired magazinersquos website35

Indeed greater attention towards rigorous metadata (which isbriefly put data about data often preserved in plain text format)open-access file formats documented by the InternationalOrganization for Standardization (ISO) and dedicated digitalpreservation librarians and archivists gives us hope for the future36

It is my belief that cloud storage the online hosting of data in largethird-party offsite centres mitigates the issue of obsolete storagemedium Furthermore that the move towards open source file for-mats even amongst large commercial enterprises such as Microsoftprovides greater documentation and uniformity

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A Brief Case Study Accessing and Navigating Canadarsquos DigitalCollections With an Emphasis on Tools for Born-DigitalSources

What can we do about this digital deluge There are no simpleanswers but historians must begin to conceptualize new additions totheir traditional research and pedagogical toolkits In the section thatfollows I will do two things introduce a case study and several toolsboth proprietary and those that are open and available off-the-shelfto reveal one way forward

For a case study I have elected to use Canadarsquos DigitalCollections (CDC) a collection of lsquodeadrsquo websites removed fromtheir original location on the World Wide Web and now digitallyarchived at LAC In this section I will introduce the collection usingdigitally-archived sources from the early World Wide Web and dis-cuss my research methodology before illustrating how we can derivemeaningful data from massive arrays of unstructured data37

This is a case study of a web archive Several of these method-ologies could be applied to other formats including large arrays oftextual material whether born-digital or not A distant reading of thistype is not medium specific It does however assume that the text ishigh quality Digitized primary documents do not always have textlayers and if they are instituted computationally through OpticalCharacter Recognition errors can appear Accuracy rates for digitizedtextual documents can range from as low as 40 percent for seven-teenth and eighteenth century documents to 80ndash90 percent formodern microfilmed newspapers to over 99 percent for typesetword-processed documents38 That said while historians today maybe more interested in applying distant reading methodologies tomore traditional primary source materials it is my belief that in thenot-so-distant future websites will be an increasingly significantsource for social cultural and political historians

What are Canadarsquos Digital Collections In 1996 IndustryCanada funded by the federal Youth Employment Strategy facili-tated the development of Canadian heritage websites by youth(defined as anyone between the age of 15 and 30) It issued contractsto private corporations that agreed to use these individuals as web

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

developers for the twofold purpose of developing digital skills andinstilling respect for and engagement with Canadian heritage andpreservation39 The collection initially called SchoolNet butrebranded as Canadarsquos Digital Collections in 1999 was accessiblefrom the CDC homepage After 2004 the program was shut downand the vast majority of the websites were archived on LACrsquos web-page This collection is now the largest born-digital Internet archivethat LAC holds worthy of study as content as well as thinkingabout methodological approaches

Conventional records about the program are unavailable in thetraditional LAC collection unsurprising in light of both its recentcreation and backlogs in processing and accessioning archival mate-rials We can use one element of our born-digital toolkit the InternetArchiversquos WaybackMachine (its archive of preserved early webpages)to get a sense of what this project looked like in its heyday The mainwebpage for the CDC was wwwcollectionsicgc However a visitortoday will see nothing but a collection of broken image links (two ofthem will bring you to the English and French-language versions ofthe LAC-hosted web archive although this is only discerniblethrough trial and error or through the sitersquos source code)

With the WaybackMachine we have a powerful albeit limitedopportunity to ldquogo back in timerdquo and experience the website as previ-ous web viewers would have (if a preserved website is available ofcourse) Among the over 420 million archived websites we can searchfor and find the original CDC site Between 12 October 1999 andthe most recent snapshot or web crawl (of the now-broken site) on 6July 2011 the site was crawled 410 times40 Unfortunately every siteversion until the 15 October 2000 crawl has been somehow cor-rupted Content has been instead replaced by Chinese characterswhich translate only into place names and other random words Issuesof early digital preservation noted above have reared their head inthis collection

From late 2000 onward however fully preserved CDC sites areaccessible Featured sites new entries success stories awards curric-ular units and prominent invitations to apply for project fundingpopulate the brightly coloured website The excitement is palpableJohn Manley then Industry Minister declared ldquoit is crucial that

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Canadarsquos young people get the skills and experience they need to par-ticipate in the knowledge-based economy of the futurerdquo and thatCDC ldquoshows that young people facing significant employment chal-lenges can contribute to and learn from the resources available onthe Information Highwayrdquo41 The documentation reflecting thelower level of technical expertise and scarcity of computer hardwareat the time was comprehensive participants were encouraged toidentify a collection digitize it process it (by cleaning up the data)preserve it on a hard drive and display it on the web Issues of copy-right were discussed as well as preservation issues and hardwarerequirements for computers (a computer with an Intel 386 processor8MB RAM a 1GB hard drive and 288bps modem was recom-mended) and training information for employers educators andparticipants42 Sample proposals community liaison ideas amidstother information set out a very comprehensive roadmap for a grass-roots bottom-up approach to digital preservation and heritage

Moving through the preserved websites is an interesting snap-shot at how web design standards changed Colour disappearsreplaced by a more subdued coordinated federal website standard(still largely in effect today) Surveys greet viewers from 2003onwards trying to learn how viewers discovered the site whetherthey think it is a good educational resource (they must have beenrelieved that the majority thought so mdash in 2003 some 84 percentfelt that it was a ldquogoodrdquo resource the best of its type)43 By 2004 theprogram was engaging in active remembrance of its past programsas it neared its ten-year anniversary Proposals stopped beingaccepted in July 2004 but the 600 webpages were still accessiblewith a comprehensive subject index alphabetical listing success sto-ries and detailed copyright information

The Canadarsquos Digital Collections site began to move towardsremoval in November 2006 and was initially completely gone A visi-tor to the site would learn that the content ldquois no longer availablerdquo andan e-mail address was provided for contributors to get their informationback44The information was completely inaccessible and by early 2007a bold notice announced that the now-empty site would close as of 2April 2007 Luckily by April 2007 a notice appeared with a declara-tion that the collections were now archived by Library and Archives

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Canada just in time for the website to completely disappear by July2007 In the archived version subject indexes and headings were goneAll that remained then and today for that matter is a straight-forwardalphabetical list and a disclaimer that noted ldquo[y]ou are viewing a doc-ument archived by Library and Archives Canada Please noteinformation may be out of date and some functionality lostrdquo

The question now is what we can do with this collection ofwebsites Could a distant reading as opposed to a close reading ona targeted level tell us something about this collectionFortunately we have a variety of tools that can help us navigate thislarge array of information several of which I use to navigate largeamounts of data First I create my own tools through computerprogramming This work is showcased in the following section I program in Wolfram Researchrsquos Mathematica 9 Mathematica is anintegrated platform for technical computing that allows users toprocess visualize and interact with exceptional arrays of informa-tion45 It is a language with an engaged and generous user base thatcomplements exhaustive well-written documentation fromWolfram Research To emphasize its ease one line of code couldimport a website in perfectly readable plaintext (avoiding unicodeerrors) as follows input=Import[ldquohttpactivehis-torycaldquordquoPlaintextrdquo] and a second one could make itall in lower text for textual analysis as such lowerpage=ToLowerCase[input] Variables are a critical building blockIn the former example ldquoinputrdquo now holds the entire plaintext of thewebsite ActiveHistoryca and then ldquolowerpagerdquo holds it in lowercaseOnce we have lower-cased information we are quickly able to extractthe following frequent words frequent phrases and could thus cre-ate a dynamic database of n-grams word clouds or run other textualanalysis programs on the material Much of the syntax adheres to arecognizable English-language format

Even more promising is Mathematicarsquos integration with theWolfram|Alpha database (itself accessible at wolframalphacom) acomputational knowledge engine that promises to put incredibleamounts of information at a userrsquos fingertips With a quick com-mand one can access the temperature at Torontorsquos Lester B PearsonInternational Airport at 500 pm on 18 December 1983 the rela-

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

tive popularity of names such as ldquoEdithrdquo or ldquoEmmardquo the Canadianunemployment rate in March 1985 or the Canadian GrossDomestic Product in December 199546 This allows one to quicklycross-reference findings with statistical information or find coinci-dental points of comparison As universities are now being activelyencouraged to contribute their information to Wolfram|Alpha thiswill become an increasingly powerful database

For many researchers the downside of Mathematica is a con-siderable one It is proprietary software and relatively expensive forstudent and non-student alike Outputs however can be dissemi-nated using the Computational Document Format (CDF) enablinghistorians to put their work online and entice the general public todynamically work with their models and findings

Programming is not the be all and end all Information visual-ization in the humanities is a growing international field withCanadian scholars heavily involved The most important and acces-sible example is Voyant Tools47 It is a powerful suite of textualanalysis tools Uploading onersquos own text or collection of texts draw-ing on a webpage or analyzing existing bodies of work one sees thefollowing a ldquoword cloudrdquo of a corpus click on any word to see itsrelative rise and fall over time (word trends) access the original textand click on any word to see its contextual placement within indi-vidual sentences It ably facilitates the ldquodistancerdquo reading of anyreasonably-sized body of textual information and should be a begin-ning point for those interested in this field The only downside isthat Voyant can choke on very large amounts of information as it isa textual analysis rather than robust data mining tool

There are several other complementary tools A notable one is theMany Eyes website hosted by IBM Research Many Eyes allows you toupload datasets be they textual tabular or numerical There are manyvisualization options provided you can plot relationships betweendata compare values (ie bar charts and bubble charts) track dataover time and apply information to maps The only downside toMany Eyes is that you must register and make your datasets available toall other registered researchers as well as IBM This limits your abilityto use copyrighted or proprietary information

More advanced technical users can avail themselves of a host of

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

textual analysis programs found elsewhere on the web The mostcomprehensive gathering of these tools can be found with theSoftware Environment for the Advancement of Scholarly Research(SEASR) and its Meandre workbench Meandre which can beinstalled as a local server on your system or mdash in a best case scenariomdash hosted by an institutional IT department allows you to tap intoa variety of workflows to analyze data48 One workflow mentionedin this article is the topic modeling toolkit found within the MachineLearning for Language Toolkit (MALLET) package which clustersunstructured data into ldquotopicsrdquo which appear together giving a quicksense of a documentrsquos structure49 Other workflows found withinSEASR include Dunningrsquos log-likelihood statistic which compares aselected document to a broader reference corpus For example youcould compare the lyrics that appear in 1968 against all lyrics in thepostwar era and learn what makes 1968 most unique50 sentimentanalyses which attempt to plot changes in emotional words andphrases over time and a variety of entity extraction algorithms(which find ldquopeoplerdquo ldquoplacesrdquo and so forth in a text) Corpuses thatwere previously too big to extract meaningful information from cannow be explored relatively quickly

As befits a technical side to history all of the freely availabletools discussed briefly here (notably wget SEASR and MALLET)are discussed in more technical terms in the peer-reviewedProgramming Historian 2 or on my own blog at ianmilliganca I haveelected to leave them online rather than provide commands in thisarticle they require some technical knowledge and directly workingfrom the web is quicker than here Please consider the following sec-tion an applied version of some of the lessons you can learn throughthese processes

From Web Collection to Dynamic Finding Aids

Let us briefly return to our archival metaphor Todayrsquos historiandealing with conventional sources often has to physically travel to anarchive requisition boxes perhaps take photographs of materials iftime is tight and otherwise take notes With born-digital sourceshowever the formula is reversed Our sources can come to us In a

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

few keystrokes we can bring an unparalleled quantity of informationto our home or work computers With Canadarsquos Digital Collectionsa single command line entry brought the entire archive onto my sys-tem it took days to process the information and I had to leave mycomputer on but it did not require any active user engagement

How big is the CDC collection in terms of sheer informationThe plaintext without any HTML markup is approximately7805135 words If each page contained 250 words it would be theequivalent of a 31220 page essay or 1000 dissertations However itis even more complicated than just reading 1000 dissertations Theinformation is spread across 73585 separate HTML files themselvespart of a much broader corpus of 278785 files Even the sheer sizeof the database requires big data methodologies an aggregate size of726 GB 360 MB of plaintext with HTML encoding reduced to alsquomerersquo 50MB if all encoding is removed A 50MB plain text docu-ment can quickly slow down even a relatively powerful personalcomputer There is no simple way to make this a conventionallyaccessible amount of information

How can we access this information Several different methodsare possible First we can work with the information on the internetitself or we can download it onto our own system I prefer the lattermethod as it decreases computational time adds flexibility andenables free rein to experiment with material Before you do thisthough you need to be familiar with the websitersquos ldquoterms of servicerdquoor the legal documentation that outlines how users are allowed to usea website resources or software In this case LAC provides generousnon-commercial reproduction terms information ldquomay be repro-duced in part or in whole and by any means without charge orfurther permission unless otherwise specifiedrdquo51

The CDC web archive index is organized like so the index islocated at wwwepelac-bacgcca100205301iccdcEAlphabetaspwith every subsequent website branching off the cdc folder Forexample one website would be hosted in wwwepelac-bacgcca100205301iccdcABbenevolat the next in hellipcdcabnatureand so on Using wget a free software program you can downloadall websites in a certain structure the program recursively downloadsinformation in a methodical fashion I first used wget to save the

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

cookies (little files kept on your system in this case one that notedthat you had seen the disclaimer about the archival nature of thewebsite) and then with an one-line command in my command lineto download the entire database52 Due to the imposition of a ran-dom wait and bandwidth limit which meant that I would downloadone file every half to one and a half seconds at a maximum of 30kilobytes a second the process took six days In the ProgrammingHistorian 2 I have an article expanding upon wget for humanitiesresearchers

Hypertext itself deserves a brief discussion The basic buildingblock of the web is the link one page is connected to another andpeople navigate through these webs While books and articles forexample are connected to each other by virtue of ideas and concretecitations these must often be inferred online HTML born-digitalsources are different in that the link offers a very concrete file One

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Visualizing the CDC Collection as a Series of Hyperlinks

quick visualization the circular cluster seen above lets us see theunique nature of this information We can go to each web pagescrape the links and see where they take viewers As we see in thebelow figure the CDC collection can be visualized as a series ofhyperlinks (each line) The cluster in the middle is the CDC collec-tion itself whereas the strands on the outside represent externalwebsites Using Mathematica I have generated a dynamic modelyou can move your mouse over each line learn what link it is andthus see where the collection fits into the broader Internet The siteis limited to following two links beyond the first page Through thisapproach we see the interconnection of knowledge and relationshipsbetween different sources At a glimpse we can see suggestions to fol-low other sites and determine if any nodes are shared amongst thevarious websites revealed

Visualization Moving Beyond Word Clouds

Where to begin with a large collection such as the Canadarsquos DigitalCollection corpus As a historian who is primarily interested in thewritten word my first step was to isolate the plaintext throughout thematerial so that I could begin rudimentary data mining and textualanalysis on the corpus For this I approached the CDC collection ontwo distinct levels mdash first the aggregate collection and second divid-ing the collection into each lsquofilersquo or individual website mdash derivingword and phrase data for each Imagine if you could look throughinto an archival collection and get a quick sense of what topics andideas were covered Taking that thought further imagine the utility ifyou could do this with individual archival boxes and even files Witha plethora of born-digital sources this is now a necessary step

On an aggregate level word and phrase data is a way to explorethe information within How often do various words within thenearly eight million words of the collection appear UsingMathematica we can quickly generate a list of words sorted by theirrelative frequency we learn that the word ldquoCanadardquo appears 26429times ldquocollectionsrdquo 11436 and ldquo[L]ouisbourgrdquo an astounding9365 times Reading such a list can quickly wear out a reader and isof minimal visual attractiveness Two quick decisions were made to

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

enhance data usability first the information was ldquocase-foldedrdquo allsent to lower-case this enables us to count the total appearances of aword such as ldquodigitalrdquo regardless of whether or not it appears at thebeginning of a sentence

Second when dealing with word frequency mdash as opposed tophrase frequency mdash ldquostop wordsrdquo should be removed as they werehere These are ldquoextremely common words which would appear to beof little value in helping select documents matching a user needrdquowords such as ldquotherdquo ldquoandrdquo ldquotordquo and so forth They matter forphrases (the example ldquoto be or not to berdquo is comprised entirely ofstop words) but will quickly dominate any word-level visualizationor list53 Other words may need to be added or removed from stoplists depending on the corpus for example if the phrase ldquopage num-berrdquo appeared on every page that would be useless information forthe researcher

The first step then is to use Mathematica to format word fre-quency information into Wordlenet readable format (if we put allof the text into that website it will crash mdash so we need to followthe instructions on that website) we can then draw up a wordcloud of the entire collection Here we can see both the advantagesand disadvantages of static word clouds What can we learn from this

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A Word Cloud of All Words in the CDC Corpus

illustration First without knowing anything about the sites we canimmediately learn that it is dealing with CanadaCanadians that itis a collection with documents in French and English that we aredealing with digitized collections and that there is a fair amount ofregional representation not based upon population (note thatldquoontariordquo is less frequent than both ldquonewfoundlandrdquo ldquonova scotiardquoand ldquoalbertardquo although ldquoqueacutebecrdquo and ldquoquebecrdquo is not terribly promi-nent) Certain kinds of history dominate communities familynames the built form (houses churches schools and so forth) At avery macro level we can learn several characteristics before doing anydeeper investigation

These forms of visualization need to be used with extreme cau-tion however First while visually attractive it leaves a severeamount of ambiguity are ldquoontariordquo and ldquoqueacutebecrdquo equally promi-nent for example or is there a slight divergence Even moreproblematically we cannot learn context from the above picturedoes the word ldquoschoolrdquo refer to built structures or educational expe-riences The same question with applies to ldquochurchrdquo Do mentionsof ldquonewfoundlandrdquo refer to historical experiences there or withexpatriates or in travel narratives Word clouds are useful first stepsand have their place but need to be used with extreme caution Formore scholarly applications tables offer much more precise inter-pretation As the second figure below demonstrates they can bemore helpful when dealing with the individual sites cuttingthrough ambiguity The figure below is a visualization of the web-site ldquopast and presentrdquo Despite the ambiguous title we can quicklylearn what this website contains from this visualization Wordle isthus a useful first step analogous to taking magically taking a quicklook through the side of an archival box We still need more infor-mation however to really make this form of textual analysishelpful To navigate this problem I have created my own dynamicfinding aid program using Mathematica which scales to any born-digital web collection

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Two approaches were taken First a textual representation of wordand phrase frequency allows us to determine the content of a given filerelatively quickly I wrote a short program to extract word frequency(stop words excluded) and phrase frequency For phrases we speak ofn-grams bigrams trigrams quadgrams and fivegrams although onecould theoretically do any higher number A bigram can be a combina-tion of two characters such as lsquobirsquo two syllables or mdash pertinent in ourcase mdash two words An example ldquocanada isrdquo A trigram is three wordsquadgram is four words and a fivegram is five words and so on

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Word Cloud of the File ldquoPast and Presentrdquo

A viewer for n-grams in the CDC Collection

The CDC collection demonstrates this approachrsquos utility Usingmy Mathematica textual CDC browser pictured above I randomlyexplored file 267 the ambiguously named ldquopast and presentrdquo web-site This is one of the largest individual websites 122363 words orabout 493 standard typed pages Such an ambiguous name could beabout almost anything Initially I decided to study word frequency

ldquoldquoalbertardquo 758rdquo ldquoldquolistenrdquo 490rdquo ldquoldquosettlersrdquo 467rdquoldquoldquonewrdquo 454rdquo ldquoldquocanadardquo 388rdquo ldquoldquolandrdquo 362rdquo ldquoldquoreadrdquo328rdquo ldquoldquopeoplerdquo 295rdquo ldquoldquosettlementrdquo 276rdquo ldquoldquochineserdquo268rdquo ldquoldquocanadianrdquo 264rdquo ldquoldquoplacerdquo 255rdquo ldquoldquoheritagerdquo254rdquo ldquoldquoedmontonrdquo 250rdquo ldquoldquoyearsrdquo 238rdquo ldquoldquowestrdquo 236rdquoldquoldquoraymondrdquo 218rdquo ldquoldquoukrainianrdquo 216rdquo ldquoldquocamerdquo 213rdquoldquoldquocalgaryrdquo 209rdquo ldquoldquonamesrdquo 206rdquo ldquoldquoranchrdquo 204rdquoldquoldquoupdatedrdquo 193rdquo ldquoldquohistoryrdquo 190rdquo ldquoldquocommunitiesrdquo183rdquo hellip

Rather than having to visit the website at a glance we can almostimmediately see what the site is about More topics become apparentwhen one looks at trigrams for example in this automatically gen-erated output of generated top phrases

ldquoldquofirstrdquo ldquopeoplerdquo ldquosettlersrdquo 100rdquo ldquoldquotherdquo ldquounitedrdquoldquostatesrdquo 72rdquo ldquoldquoonerdquo ldquoofrdquo ldquotherdquo 59rdquo ldquoldquoofrdquo ldquotherdquoldquowestrdquo 58rdquo ldquoldquosettlersrdquo ldquolastrdquo ldquoupdatedrdquo 56rdquo ldquoldquopeoplerdquoldquosettlersrdquo ldquolastrdquo 56rdquo ldquoldquoalbertansrdquo ldquofirstrdquo ldquopeoplerdquo 56rdquoldquoldquoadventurousrdquo ldquoalbertansrdquo ldquofirstrdquo 56rdquo ldquoldquotherdquo ldquobarrdquoldquourdquo 55rdquo ldquoldquoopeningrdquo ldquoofrdquo ldquotherdquo 54rdquo ldquoldquocommunitiesrdquoldquoadventurousrdquo ldquoalbertansrdquo 54rdquo ldquoldquonewrdquo ldquocommunitiesrdquoldquoadventurousrdquo 54rdquo ldquoldquobarrdquo ldquourdquo ldquoranchrdquo 53rdquo ldquoldquolistenrdquoldquotordquo ldquotherdquo 52rdquo hellip

Through a few clicks we can navigate and determine the basiccontours mdash from a distance mdash of what the site pertains to In aviewer you can use a slider to dynamically see more informationor less information quickly toggling between bigrams trigramsand quadgrams allowing you to quickly hone in on immediately

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

relevant information This can be useful but we need to moveback even further to achieve a distant reading of this born-digitalcollection

If one zooms back out to this very distant level we can extractn-gram data from the entire collection We have two approachesfirst aggregate appearances of a given term (if ldquoaboriginalrdquo appears457 times then n=457) or second relative appearances of a giventerm (if ldquoaboriginalrdquo appears in 45 of 384 sites then n=45384 or01171875) In my other work on music lyrics the latter optionenables you to control for the lsquochorusrsquo effect as the former can be drowned out by one song with repetitive words Given thenature of this collection I elected for the former approach Whatcan we learn from trigrams for example After an initial set of trigrams pertaining to the Louisbourg digitization project whichhas boilerplate language on each page as well as common ldquostop-wordrdquo equivalents of phrases (especially pronounced in French) we come across the first set of meaningful phrases excerptedbelow

ldquoexperiencerdquo ldquoinrdquo ldquotherdquo 919 ldquotransferredrdquo ldquotordquo ldquotherdquo918 ldquofundedrdquo ldquobyrdquo ldquotherdquo 918 ldquoofrdquo ldquotherdquo ldquoprovincialrdquo915 ldquothisrdquo ldquowebrdquo ldquositerdquo 915 ldquo409rdquo ldquoregistryrdquo ldquo2rdquo914 ldquoarerdquo ldquonordquo ldquolongerrdquo 914 ldquogouvernementrdquo ldquodurdquo ldquocanadardquo 910 ldquouserdquo ldquothisrdquo ldquolinkrdquo 887ldquopagerdquo ldquoorrdquo ldquoyourdquo 887 ldquofollowingrdquo ldquopagerdquo ldquoorrdquo 887 ldquoautomaticallyrdquo ldquotransferredrdquo ldquotordquo 887 ldquoberdquoldquoautomaticallyrdquo ldquotransferredrdquo 887 ldquowillrdquo ldquoberdquo ldquoautomaticallyrdquo 887 ldquoindustryrdquo ldquoyourdquo ldquowillrdquo 887ldquomultimediardquo ldquoindustryrdquo ldquoyourdquo 887 ldquopracticalrdquo ldquoexperiencerdquo ldquoinrdquo 887 ldquogainedrdquo ldquopracticalrdquo ldquoexperiencerdquo887 ldquotheyrdquo ldquogainedrdquo ldquopracticalrdquo 887 ldquowhilerdquo ldquotheyrdquoldquogainedrdquo 887 ldquocanadiansrdquo ldquowhilerdquo ldquotheyrdquo 887ldquoyoungrdquo ldquocanadiansrdquo ldquowhilerdquo 887 ldquoshowcaserdquo ldquoofrdquoldquoworkrdquo 887 ldquoexcellentrdquo ldquoshowcaserdquo ldquoofrdquo 887 hellip

Here we see an array of common language funding informa-tion link information project information spread across over 887

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

websites Reading down we see information about governments jobexperience project information initiatives re-structuring and soforth At a glance we can see what language is common to all web-sites as opposed to unique information Moving down we find themore common but less universal ldquothe history ofrdquo for example ldquoher-itage community foundationrdquo libraries and so forth By the end ofthe list we can move into the unique or very rare phrases that appearonly once twice or three times

Moving beyond word and phrase frequency there are a fewmore sophisticated off-the-shelf options that can help us make senseof large arrays of unstructured data Topic modeling is among themost developed as Princetonrsquos David M Blei (a Computer Scienceprofessor) describes in a general introduction to the topic

While more and more texts are available online we simplydo not have the human power to read and study them toprovide the kind of [traditional] browsing experiencedescribed above To this end machine learning researchershave developed probabilistic topic modeling a suite of algo-rithms that aim to discover and annotate large archives ofdocuments with thematic information Topic modelingalgorithms are statistical methods that analyze the words ofthe original texts to discover the themes that run throughthem how those themes are connected to each other andhow they change over time hellip Topic modeling enables usto organize and summarize electronic archives at a scalethat would be impossible by human annotation55

Using the aforementioned SEASR Meandre workbench (or viaMALLET itself and a command-line interface) we can quickly beginto run our born-digital information through the topic-modelingalgorithm The results are notable and come in two formats a wordcloud visualization as well as an XML file containing a line-by-linebreakdown of each topic Taking the aforementioned file 267 (ldquopastand presentrdquo) we can begin to get a sense of the topics this large web-site discusses

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

At a glance we can learn about some of the main themes the struc-ture of Alberta (Alberta towns bridges lakes) the geographic shapeof the area (life land lands areas links grass and interestingly lan-guage) the stories (story knight archives but also interestinglyknight and wife) and so forth The word cloud gives us a sense ofthe most relevant terms that define each topic but we also have a3200-line XML file broken into topics which lets us further refineour inquiry

A last set of examples will help us move from the aggregate levelof hundreds of thousands of lines of text to the individual line inessence creating our own search engine to navigate this collection

There are a few good reasons why we should not use Google todo all of this in our own work First we do not know how Googleworks precisely It is a proprietary black box While the original algo-rithm is available the actual methods that go into determining theranking of search terms is a trade secret Second Google is optimizedfor a number of uses mainly to deliver the most relevant informa-tion to the user Historians and archivists are looking often forspecific information within old websites and it behooves us to exam-ine them through the lens of more specialized tools With thesehesitations in mind I have sought to develop my own rudimentarysearch engine for this born-digital collection

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A two-pronged approach can take us to relevant informationquickly In the following example I am carrying out a project onblack history We quickly search all of the CDC word frequency filesfor the occurrence of the word ldquofreedomrdquo ldquoblackrdquo and ldquolibertyrdquo(these terms of course require some background knowledge thesearch is shaped by the user) obtain a list and then map it onto achart in order to quickly visualize their locations in the collection Forexample in the graph above we see a visualization of a given term inthis case our three search terms As one moves the mouse over eachbar the frequency data for a given term appears We can also zoomand enhance this graph to use it as we wish A demonstration of thiscan be found online at wwwianmilliganca20130118cdc-tool-1

We can then move down into the data In the example aboveldquo047rdquo labeled ldquoblackgoldtxtrdquo contains 31 occurrences of the wordldquoontariordquo Moving to my Mathematica keyword-in-context routinewe can see where these occurrences of the word appear

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

A snapshot of the dynamic finding aid tool

We see that this website focuses mostly on southwestern Ontariowhere many oil fields are If we move forward we can see that it per-tains to the nineteenth century a particular region as well assubsequent memorialization activities (mention of a museum forexample) All this can be done from within the textual client asopposed to going out into individual websites Broad contoursemerge from this distant reading but they also facilitate deeper read-ing by allowing us to promptly realize critical sources

Another model through which we can visualize documents isthrough a viewer primarily focusing on term document frequencyand inverse frequency terms discussed below To prepare this mater-ial we are combining our earlier data about word frequency (that theword ldquoexperiencerdquo appears 919 times perhaps) with aggregate infor-mation about frequency in the entire corpus (ie the wordldquoexperiencerdquo appears 1766 times) The number of times that a termappears in a given website can be defined as tf (term frequency)Another opportunity is inverse document frequency where we canlearn what words are most distinct and unique in a given documentThe formula for this is the oft-cited tf-idf a staple of informationretrieval theory or tf times the logarithm of the number of docu-ments divided into the number of documents where the given termappears At a glance then we can see some of the unique words con-tained within a website if we select ldquoinverse document frequencyrdquo

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

At a very quick glance one can see the major words that make this aunique document This technique can be used on any textual docu-ment

Off-the-shelf tools can facilitate a deeper yet distant reading ofindividual websites as well especially the suite of research tools onwebsites such as Many Eyes and Voyant Using these tools howeverrequires that one puts individual data into each separate programrather than quickly navigating between them This stems from themassive amount of files and data involved Many Eyes limits files to5MB and Voyant has no hard and fast rule but can struggle withlarger files

This analysis was carried out on my personal computer55 Largerdata sets may necessitate a corresponding increase in computationalpower bringing social sciences and humanities researchers into con-tact with High Performance Computing (HPC) centres or consortiaThese extremely powerful computers provide researchers with com-puting power orders of magnitude above that of personal computersallowing results in hours instead of the weeks or months it might

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

take to process on a smaller system56 This has particular applicationin the basic processing of a very large web archive There are limita-tions however jobs must be submitted into queues in advancerather than on-site processing and perhaps most importantly HPCcentres prefer to run highly optimized code to minimize usage oftheir very expensive and scarcely allocated equipment As this level ofcoding expertise is often beyond humanities researchers researchteams must be formed

Humanities researchers would do well to familiarize themselveswith local HPC options many of which are more than happy towork on new and engaging questions These are accessible optionsin Ontario for example all but one university belongs to a HPCconsortium57 Yet as personal computing continues to dramaticallyincrease in speed and as storage prices continue to plummet textualanalysis is increasingly feasible on high-quality office systems An ini-tial exposure can generate questions and problems that if the needrears itself can subsequently be taken to the supercomputer

This case study is intended to provide one model for distantreading and the generation of finding aids in the interest of showingsome of the potential for working with born-digital sources It isintended to begin or jumpstart a conversation around the resourcesand skills necessary for the next generation A seemingly insur-mountable amount of information contained across over 73000files (as in the CDC collection) can be quickly navigated and iso-lated for researcher convenience

Conclusion Digital Literacy for the Next Generation ofHistorians

As a discipline I would like to suggest one path forward prepare thenext generation of historians at least some of them to deal with thedeluge of born-digital sources Fundamentally the challenge of born-digital sources is a pedagogical one For existing historians this is anissue of continual education for future historians an issue of profes-sional training (the annual Digital Humanities Summer Instituteheld at the University of Victoria each June is worth considering asan introduction to many digital concepts) Some off-the-shelf tools

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

can serve as gateways to deeper engagement notably the VoyantTools suite (wwwvoyant-toolsorg) and perhaps subsequently thelessons contained in the Programming Historian 2 That said digitalliteracy courses should become part of the essential pantheon of his-torical curricula This should not be mandatory but instead taughtalongside and within traditional historiography and research meth-ods courses In an era of diminishing resources and attention paid toknowledge translation and relevancy digital literacy courses are alsowell positioned to provide graduates with concrete portfolios andactionable skills If we want to be prepared to use digital method-ologies to study the post-1990 world we will need ready historians

Such a course should have six basic priorities First we must pro-vide historiographical context for new students on whatcomputational methods have been tried before How have quantita-tive and qualitative methods differed yet ultimately complementedeach other Second students in this course would need to develop anunderstanding of the digital basics cloud computing effectivebackup and versioning digital security and an understanding of howthe Internet is organized Third for effective cross-departmental andsustained engagement with digital history students must to learn howto digitally navigate conventional sources notable skill sets wouldinclude a familiarity with citation management software databasesand the possibility of using optical character recognition on a varietyof born-paper sources subsequently digitized through scanners orcameras This must also include a grasp of the scope and significanceof large digital depositories including Google Books the InternetArchive and others mentioned earlier Fourth students must learnhow to process large amounts of digital information this includes off-the-shelf visualization tools material discussed here and some of thephilosophical and methodological implications of a distance readingFifth students can learn basic programming through the ProgrammingHistorian Finally and perhaps most importantly students need anunderstanding of various ways to openly disseminate material on the web whether through blogging micro-blogging publishing webpages and so forth it is important to enable the great work that we do in the university to make its mark on the world Throughthis students will also come away from history departments with

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

comprehensive and marketable digital portfolios Such a coursewould focus more on homework than reading with special emphasisplaced on deliverable content such as websites interactive modelsand broadly implementable databases

Digital history and the readiness to work with born-digitalsources cannot stem out of pedagogy alone however Institutionalsupport is necessary While based on anecdotal evidence it appearsthat some university administrators have begun to embrace the shifttowards the digital humanities At the federal level SSHRC has iden-tified the Digital Economy as one of its priority areas58 Yet when itcomes to evaluation doctoral and post-doctoral researchers do nothave a ldquodigital humanitiesrdquo option open to them in terms of assessingcommittees leading to fears that their projects might be lost inbetween disciplinary areas of focus59 In light of the impending crisisof born-digital sources historians should encourage these initiatives

These sorts of digital projects do not always lend themselveswell to traditional publication venues The examples above from theCDC case study for example are best viewed as dynamically mal-leable models this will be one challenge as historians grapple withnew publication models This article argues first why digitalmethodologies are necessary and second how we can realize aspectsof this call The Internet is 20 years old and before historians realizeit we will be faced with the crisis of born-digital information over-load If history is to continue as the leading discipline inunderstanding the social and cultural past decisive movementtowards the digital humanities is necessary When the sea comes inwe must be ready to swim

IAN MILLIGAN is an assistant professor of Canadian history at theUniversity of Waterloo He is a founding co-editor of ActiveHistorycaan editor-at-large of the Programming Historian 2 and has written sev-eral articles in the fields of Canadian youth digital and labour history

IAN MILLIGAN est professeur adjoint en histoire canadienne agrave laUniversity of Waterloo Il est le fondateur et codirecteur du site

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Internet ActiveHistoryca ainsi qursquoun eacutediteur du site ProgrammingHistorian 2 Il a eacutecrit plusieurs articles au sujet de la jeunesse cana-dienne de la numeacuterisation et de lrsquohistoire des travailleurs

Endnotes

1 The question of when the past becomes a ldquohistoryrdquo to be professionallystudied is a tricky one Yet it is worth noting that the still-definitivework on the Baby Boom Generation appeared in 1996 with DougOwram Born at the Right Time A History of the Baby Boom Generationin Canada (Toronto University of Toronto Press 1996) This actuallyfollowed the earlier scholarly monograph Cyril Levitt Children ofPrivilege Student Revolt in the Sixties (Toronto University of TorontoPress 1984) and a host of other laudable works As academicdebatesworks on the 1960s now suggest the past becomes historysooner than we often expect

2 A critical point advanced in the field of literary criticism and theory byStephen Ramsay Reading Machines Towards an Algorithmic Criticism(Urbana University of Illinois Press 2011)

3 See ldquoIdleNoMore Timelinerdquo ongoingwwwidlenomoremakookcatimeline44 ltviewed 18 January 2013gt

4 These issues have been discussed elsewhere albeit in either a moreabridged form or with a focus on digitized historical sources A notableexception is Daniel J Cohen and Roy Rosenzweig Digital History AGuide to Gathering Preserving and Presenting the Past on the Web(Philadelphia University of Pennsylvania Press 2005) specifically theirchapter on ldquoCollecting History onlinerdquo Their focus is on the preservation-ist and archival challenges however Important notes are also found inDan Cohen ldquoHistory and the Second Decade of the Webrdquo RethinkingHistory 8 no 2 (June 2004) 293ndash301 also available online atwwwchnmgmueduessays-on-history-new-mediaessaysessayid=34

5 Figure taken from Jose Martinez ldquoTwitter CEO Dick Costolo Reveals Staggering Number of Tweets Per Dayrdquo ComplexTech wwwcomplexcomtech201210twitter-ceo-dick-costolo-reveals-staggering-number-of-tweets-per-day ltviewed 6 January 2013gtFurther more while a tweet itself is only 140 characters every tweetcomes with over 40 lines of metadata date information author biogra-phy and location information when the account was created the usersrsquofollowing and follower numbers country data and so forth This is aremarkable treasure trove of information for future historians but also apotential source of information overload See Sarah Perez ldquoThis is What

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

a Tweet Looks Likerdquo ReadWriteWeb (19 April 2010) wwwreadwritewebcomarchivesthis_is_what_a_tweet_looks_likephpltviewed 6 January 2013gt

6 This is further discussed in Ian Foster ldquoHow Computation ChangesResearchrdquo in Switching Codes Thinking Through Digital Technology inthe Humanities and the Arts eds Thomas Bartscherer and RoderickCoover (Chicago University of Chicago Press 2011) esp 22ndash4

7 Franco Moretti Graphs Maps Trees Abstract Models for Literary History(New York Verso 2005) 3ndash4

8 While not of prime focus here the longue dureacutee deserves definition Itstands opposite the history of events of instances covering instead avery long time span in an interdisciplinary social science framework In along essay Braudel noted that one can then see both structural crises ofeconomies and structures that constrain human society and develop-ment For more see Fernand Braudel ldquoHistory and the Social SciencesThe Longue Dureacuteerdquo in Fernand Braudel On History trans SarahMatthews (Chicago University of Chicago Press 1980) 27 The essaywas originally printed in the Annales ESC no 4 (OctoberndashDecember1958)

9 A great overview can be found at ldquoText Analysisrdquo Tooling Up for DigitalHumanities 2011 wwwtoolingupstanfordedupage_id=981 ltviewed19 December 2012gt

10 Jean-Baptiste Michel Yuan Kui Shen Aviva Presser Aiden Adrian VeresMatthew K Gray The Google Books Team Joseph P Pickett DaleHoiberg Dan Clancy Peter Norvig Jon Orwant Steven Pinker MartinA Nowak Erez Lieberman Aiden ldquoQuantitative Analysis of CultureUsing Millions of Digitized Booksrdquo Science 331 (14 January 2011)176ndash82 The tool is publicly accessible at ldquoGoogle Books NgramViewerrdquo Google Books 2010 wwwbooksgooglecomngrams ltviewed11 January 2013gt

11 Anthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

12 Comment by Jean-Baptiste Michel and Erez Lieberman Aiden onAnthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

13 For more information see the Canadian Century Research InfrastructureProject website at wwwccriuottawacaCCRIHomehtml and the Programme de recherche en deacutemographie historique at wwwgeneologyumontrealcafrleprdhhtm

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

14 ldquoWith Criminal Intentrdquo CriminalIntentorg wwwcriminalintentorgltviewed 20 December 2012gt

15 For an overview see ldquoAward Recipients mdash 2011rdquo Digging into DataWebpage wwwdiggingintodataorgHomeAwardRecipients2011tabid185Defaultaspx ltviewed 20 December 2012gt

16 One can do no better than the completely free Programming Historianwhich uses entirely free and open-source programming tools to learn thebasics of computational history See The Programming Historian 2 2012-Present wwwprogramminghistorianorg ltviewed 11 January 2013gt

17 Michael Katz The People of Hamilton Canada West Family and Class ina Mid-Nineteenth-Century City (Cambridge MA Harvard UniversityPress 1975) Also see A Gordon Darroch and Michael D OrnsteinldquoEthnicity and Occupational Structure in Canada in 1871 The VerticalMosaic in Historical Perspectiverdquo Canadian Historical Review 61 no 3(1980) 305ndash33

18 Ian Anderson ldquoHistory and Computingrdquo Making History The ChangingFace of the Profession in Britain 2008 wwwhistoryacukmakinghistoryresourcesarticleshistory_and_computinghtml ltviewed 20 December2012gt

19 Christopher Collins ldquoText and Document Visualizationrdquo TheHumanities and Technology Camp Greater Toronto Area VisualizationBootcamp University of Ontario Institute of Technology Oshawa Ont21 October 2011

20 The quotations by Luther and Poe are commonly used to make thispoint For the best example see Clay Shirky Cognitive SurplusCreativity and Generosity in a Connected Age Google eBook Penguin2010 For Mumford see Lewis Mumford The Myth of the Machine vol2 The Pentagon of Power (New York Harcourt Brace 1970) 182 asquoted in Gleick The Information A History A Theory A Flood (NewYork Pantheon 2011) 404

21 Adam Ostrow ldquoCisco CRS-3 Download Library of Congress in JustOver One Secondrdquo Mashable (9 March 2010) accessed onlinewwwmashablecom20100309cisco-crs-3

22 A note on data size is necessary The basic unit of information is a ldquobyterdquo (8 bits) which is roughly a single character A kilobyte (KB) is athousand bytes a megabyte (MB) a million a gigabyte (GB) a billion aterabyte (TB) a trillion and a petabyte (PB) a quadrillion Beyond thisthere are exabytes (EB) zettabytes (ZB) and yottabytes (YB)

23 These figures can of course vary depending on compression methodschosen Their math was based on an average of 8MB per book if it wasscanned at 600dpi and then compressed There are 26 million books inthe Library of Congress For more see Mike Ashenfelder ldquoTransferringlsquoLibraries of Congressrsquo of Datardquo Library of Congress Digital Preservation

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Blog 11 July 2011wwwblogslocgovdigitalpreservation201107transferring-libraries-of-congress-of-data ltviewed 11 January 2013gt

24 Gleick 232 25 ldquoAbout the Library of Congress Web Archivesrdquo Library of Congress Web

Archiving FAQs Undated wwwlocgovwebarchivingfaqhtmlfaqs_02ltviewed 11 January 2013gt

26 ldquo10000000000000000 Bytes Archivedrdquo Internet Archive Blog 26October 2012 wwwblogarchiveorg2012102610000000000000000-bytes-archivedltviewed 19 December 2012gt

27 Jean-Baptiste Michel et al 176ndash82 Published Online Ahead of Print16 December 2010 wwwsciencemagorgcontent3316014176ltviewed 29 July 2011gt See also Scott Cleland ldquoGooglersquos PiracyLiabilityrdquo Forbes 9 November 2011 wwwforbescomsitesscottcle-land20111109googles-piracy-liability ltviewed 11 January 2013gt

28 ldquoGovernment of Canada Web Archiverdquo Library and Archives Canada 17October 2007 wwwcollectionscanadagccawebarchivesindex-ehtmlltviewed 11 January 2013gt

29 For background see Bill Curry ldquoVisiting Library and Archives inOttawa Not Without an Appointmentrdquo Globe and Mail 1 May 2012wwwtheglobeandmailcomnewspoliticsottawa-notebookvisiting-library-and-archives-in-ottawa-not-without-an-appointmentarticle2418960 ltviewed 22 May 2012gt and the announcement ldquoLAC Begins Implementation of New Approach to service Deliveryrdquo Library and Archives Canada February 2012wwwcollectionscanadagccawhats-new013-560-ehtml ltviewed 22May 2012gt For professional responses see the ldquoCHArsquos comment onNew Directions for Library and Archives Canadardquo CHA Website March2012 wwwcha-shccaenNews_39items19html ltviewed 19December 2012gt and ldquoSave Library amp Archives Canadardquo CanadianAssociation of University Teachers launched AprilMay 2012 wwwsavelibraryarchivesca ltviewed 19 December 2012gt

30 Ian Milligan ldquoThe Smokescreen of lsquoModernizationrsquo at Library andArchives Canadardquo ActiveHistoryca 22 May 2012wwwactivehistoryca201205the-smokescreen-of-modernization-at-library-and-archives-canada ltviewed 19 December 2012gt

31 ldquoSave LAC May 2012 Campaign Updaterdquo SaveLibraryArchivesca May2012 wwwsavelibraryarchivescaupdate-2012-05aspx ltviewed 19December 2012gt

32 The challenges and opportunities presented by such projects is extensively discussed in Daniel J Cohen ldquoThe Future of Preserving the Pastrdquo CRM The Journal of Heritage Stewardship 2 no 2

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

(Summer 2005) 6ndash19 also available freely online at the RoyRosenzweig Center for History and New Mediawwwchnmgmueduessays-on-history-new-mediaessaysessayid=39For more see ldquoSeptember 11 Digital Archive Saving the Histories of September 11 2001rdquo Centre for History and New Media andAmerican Social History Project Center for Media and Learningwww911digitalarchiveorg ltviewed 11 January 2013gt ldquoHurricaneDigital Memory Bank Collecting and Preserving the Stories of Katrina and Ritardquo Center for History and New Media wwwhurricanearchiveorg ltviewed 11 January 2013gt and ldquoOccupyArchive Archiving the Occupy Movements from 2011rdquo Centre forHistory and New Media wwwoccupyarchiveorg ltviewed 11 January2013gt

33 Daniel Cohen and Roy Rosenzweig Digital History A Guide to Gathering Preserving and Presenting the Past on the Web(Philadelphia Temple University Press 2005) chapter on ldquoPreservingDigital Historyrdquo The chapter is available fully online atwwwchnmgmuedudigitalhistorypreserving5php ltviewed 19December 2012gt

34 For an overview see ldquoPreserving Our Digital Heritage The NationalDigital Information Infrastructure and Preservation Program 2010 Report a Collaborative Initiative of the Library of CongressrdquoDigitalpreservationgov wwwdigitalpreservationgovmultimediadocumentsNDIIPP2010Report_Postpdf ltviewed 19 December 2012gt

35 Steve Meloan ldquoNo Way to Run a Culturerdquo Wired 13 February 1998available online via the Internet Archivewwwwebarchiveorgweb20000619001705httpwwwwiredcomnewsculture012841030100html ltviewed 19 December 2012gt

36 Some of this has been discussed in greater depth on my website See theseries beginning with Ian Milligan ldquoWARC Files A Challenge forHistorians and Finding Needles in Haystacksrdquo ianmilliganca 12December 2012 wwwianmilliganca20121212warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks ltviewed 20 December2012gt

37 Unstructured data as opposed to structured data is important for histo-rians but brings additional challenges Some projects have producedstructured data novels marked up using XML encoding for example soone can immediately see place names dates character names and soforth This makes it much easier for a computer to read and understand

38 See William Noblett ldquoDigitization A Cautionary Talerdquo New Review ofAcademic Librarianship 17 (2011) 3 Tobias Blanke Michael Bryantand Mark Hedges ldquoOcropodium Open Source OCR for Small-ScaleHistorical Archivesrdquo Journal of Information Science 38 no 1 (2012) 83

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 16: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

A Brief Case Study Accessing and Navigating Canadarsquos DigitalCollections With an Emphasis on Tools for Born-DigitalSources

What can we do about this digital deluge There are no simpleanswers but historians must begin to conceptualize new additions totheir traditional research and pedagogical toolkits In the section thatfollows I will do two things introduce a case study and several toolsboth proprietary and those that are open and available off-the-shelfto reveal one way forward

For a case study I have elected to use Canadarsquos DigitalCollections (CDC) a collection of lsquodeadrsquo websites removed fromtheir original location on the World Wide Web and now digitallyarchived at LAC In this section I will introduce the collection usingdigitally-archived sources from the early World Wide Web and dis-cuss my research methodology before illustrating how we can derivemeaningful data from massive arrays of unstructured data37

This is a case study of a web archive Several of these method-ologies could be applied to other formats including large arrays oftextual material whether born-digital or not A distant reading of thistype is not medium specific It does however assume that the text ishigh quality Digitized primary documents do not always have textlayers and if they are instituted computationally through OpticalCharacter Recognition errors can appear Accuracy rates for digitizedtextual documents can range from as low as 40 percent for seven-teenth and eighteenth century documents to 80ndash90 percent formodern microfilmed newspapers to over 99 percent for typesetword-processed documents38 That said while historians today maybe more interested in applying distant reading methodologies tomore traditional primary source materials it is my belief that in thenot-so-distant future websites will be an increasingly significantsource for social cultural and political historians

What are Canadarsquos Digital Collections In 1996 IndustryCanada funded by the federal Youth Employment Strategy facili-tated the development of Canadian heritage websites by youth(defined as anyone between the age of 15 and 30) It issued contractsto private corporations that agreed to use these individuals as web

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

developers for the twofold purpose of developing digital skills andinstilling respect for and engagement with Canadian heritage andpreservation39 The collection initially called SchoolNet butrebranded as Canadarsquos Digital Collections in 1999 was accessiblefrom the CDC homepage After 2004 the program was shut downand the vast majority of the websites were archived on LACrsquos web-page This collection is now the largest born-digital Internet archivethat LAC holds worthy of study as content as well as thinkingabout methodological approaches

Conventional records about the program are unavailable in thetraditional LAC collection unsurprising in light of both its recentcreation and backlogs in processing and accessioning archival mate-rials We can use one element of our born-digital toolkit the InternetArchiversquos WaybackMachine (its archive of preserved early webpages)to get a sense of what this project looked like in its heyday The mainwebpage for the CDC was wwwcollectionsicgc However a visitortoday will see nothing but a collection of broken image links (two ofthem will bring you to the English and French-language versions ofthe LAC-hosted web archive although this is only discerniblethrough trial and error or through the sitersquos source code)

With the WaybackMachine we have a powerful albeit limitedopportunity to ldquogo back in timerdquo and experience the website as previ-ous web viewers would have (if a preserved website is available ofcourse) Among the over 420 million archived websites we can searchfor and find the original CDC site Between 12 October 1999 andthe most recent snapshot or web crawl (of the now-broken site) on 6July 2011 the site was crawled 410 times40 Unfortunately every siteversion until the 15 October 2000 crawl has been somehow cor-rupted Content has been instead replaced by Chinese characterswhich translate only into place names and other random words Issuesof early digital preservation noted above have reared their head inthis collection

From late 2000 onward however fully preserved CDC sites areaccessible Featured sites new entries success stories awards curric-ular units and prominent invitations to apply for project fundingpopulate the brightly coloured website The excitement is palpableJohn Manley then Industry Minister declared ldquoit is crucial that

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Canadarsquos young people get the skills and experience they need to par-ticipate in the knowledge-based economy of the futurerdquo and thatCDC ldquoshows that young people facing significant employment chal-lenges can contribute to and learn from the resources available onthe Information Highwayrdquo41 The documentation reflecting thelower level of technical expertise and scarcity of computer hardwareat the time was comprehensive participants were encouraged toidentify a collection digitize it process it (by cleaning up the data)preserve it on a hard drive and display it on the web Issues of copy-right were discussed as well as preservation issues and hardwarerequirements for computers (a computer with an Intel 386 processor8MB RAM a 1GB hard drive and 288bps modem was recom-mended) and training information for employers educators andparticipants42 Sample proposals community liaison ideas amidstother information set out a very comprehensive roadmap for a grass-roots bottom-up approach to digital preservation and heritage

Moving through the preserved websites is an interesting snap-shot at how web design standards changed Colour disappearsreplaced by a more subdued coordinated federal website standard(still largely in effect today) Surveys greet viewers from 2003onwards trying to learn how viewers discovered the site whetherthey think it is a good educational resource (they must have beenrelieved that the majority thought so mdash in 2003 some 84 percentfelt that it was a ldquogoodrdquo resource the best of its type)43 By 2004 theprogram was engaging in active remembrance of its past programsas it neared its ten-year anniversary Proposals stopped beingaccepted in July 2004 but the 600 webpages were still accessiblewith a comprehensive subject index alphabetical listing success sto-ries and detailed copyright information

The Canadarsquos Digital Collections site began to move towardsremoval in November 2006 and was initially completely gone A visi-tor to the site would learn that the content ldquois no longer availablerdquo andan e-mail address was provided for contributors to get their informationback44The information was completely inaccessible and by early 2007a bold notice announced that the now-empty site would close as of 2April 2007 Luckily by April 2007 a notice appeared with a declara-tion that the collections were now archived by Library and Archives

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Canada just in time for the website to completely disappear by July2007 In the archived version subject indexes and headings were goneAll that remained then and today for that matter is a straight-forwardalphabetical list and a disclaimer that noted ldquo[y]ou are viewing a doc-ument archived by Library and Archives Canada Please noteinformation may be out of date and some functionality lostrdquo

The question now is what we can do with this collection ofwebsites Could a distant reading as opposed to a close reading ona targeted level tell us something about this collectionFortunately we have a variety of tools that can help us navigate thislarge array of information several of which I use to navigate largeamounts of data First I create my own tools through computerprogramming This work is showcased in the following section I program in Wolfram Researchrsquos Mathematica 9 Mathematica is anintegrated platform for technical computing that allows users toprocess visualize and interact with exceptional arrays of informa-tion45 It is a language with an engaged and generous user base thatcomplements exhaustive well-written documentation fromWolfram Research To emphasize its ease one line of code couldimport a website in perfectly readable plaintext (avoiding unicodeerrors) as follows input=Import[ldquohttpactivehis-torycaldquordquoPlaintextrdquo] and a second one could make itall in lower text for textual analysis as such lowerpage=ToLowerCase[input] Variables are a critical building blockIn the former example ldquoinputrdquo now holds the entire plaintext of thewebsite ActiveHistoryca and then ldquolowerpagerdquo holds it in lowercaseOnce we have lower-cased information we are quickly able to extractthe following frequent words frequent phrases and could thus cre-ate a dynamic database of n-grams word clouds or run other textualanalysis programs on the material Much of the syntax adheres to arecognizable English-language format

Even more promising is Mathematicarsquos integration with theWolfram|Alpha database (itself accessible at wolframalphacom) acomputational knowledge engine that promises to put incredibleamounts of information at a userrsquos fingertips With a quick com-mand one can access the temperature at Torontorsquos Lester B PearsonInternational Airport at 500 pm on 18 December 1983 the rela-

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

tive popularity of names such as ldquoEdithrdquo or ldquoEmmardquo the Canadianunemployment rate in March 1985 or the Canadian GrossDomestic Product in December 199546 This allows one to quicklycross-reference findings with statistical information or find coinci-dental points of comparison As universities are now being activelyencouraged to contribute their information to Wolfram|Alpha thiswill become an increasingly powerful database

For many researchers the downside of Mathematica is a con-siderable one It is proprietary software and relatively expensive forstudent and non-student alike Outputs however can be dissemi-nated using the Computational Document Format (CDF) enablinghistorians to put their work online and entice the general public todynamically work with their models and findings

Programming is not the be all and end all Information visual-ization in the humanities is a growing international field withCanadian scholars heavily involved The most important and acces-sible example is Voyant Tools47 It is a powerful suite of textualanalysis tools Uploading onersquos own text or collection of texts draw-ing on a webpage or analyzing existing bodies of work one sees thefollowing a ldquoword cloudrdquo of a corpus click on any word to see itsrelative rise and fall over time (word trends) access the original textand click on any word to see its contextual placement within indi-vidual sentences It ably facilitates the ldquodistancerdquo reading of anyreasonably-sized body of textual information and should be a begin-ning point for those interested in this field The only downside isthat Voyant can choke on very large amounts of information as it isa textual analysis rather than robust data mining tool

There are several other complementary tools A notable one is theMany Eyes website hosted by IBM Research Many Eyes allows you toupload datasets be they textual tabular or numerical There are manyvisualization options provided you can plot relationships betweendata compare values (ie bar charts and bubble charts) track dataover time and apply information to maps The only downside toMany Eyes is that you must register and make your datasets available toall other registered researchers as well as IBM This limits your abilityto use copyrighted or proprietary information

More advanced technical users can avail themselves of a host of

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

textual analysis programs found elsewhere on the web The mostcomprehensive gathering of these tools can be found with theSoftware Environment for the Advancement of Scholarly Research(SEASR) and its Meandre workbench Meandre which can beinstalled as a local server on your system or mdash in a best case scenariomdash hosted by an institutional IT department allows you to tap intoa variety of workflows to analyze data48 One workflow mentionedin this article is the topic modeling toolkit found within the MachineLearning for Language Toolkit (MALLET) package which clustersunstructured data into ldquotopicsrdquo which appear together giving a quicksense of a documentrsquos structure49 Other workflows found withinSEASR include Dunningrsquos log-likelihood statistic which compares aselected document to a broader reference corpus For example youcould compare the lyrics that appear in 1968 against all lyrics in thepostwar era and learn what makes 1968 most unique50 sentimentanalyses which attempt to plot changes in emotional words andphrases over time and a variety of entity extraction algorithms(which find ldquopeoplerdquo ldquoplacesrdquo and so forth in a text) Corpuses thatwere previously too big to extract meaningful information from cannow be explored relatively quickly

As befits a technical side to history all of the freely availabletools discussed briefly here (notably wget SEASR and MALLET)are discussed in more technical terms in the peer-reviewedProgramming Historian 2 or on my own blog at ianmilliganca I haveelected to leave them online rather than provide commands in thisarticle they require some technical knowledge and directly workingfrom the web is quicker than here Please consider the following sec-tion an applied version of some of the lessons you can learn throughthese processes

From Web Collection to Dynamic Finding Aids

Let us briefly return to our archival metaphor Todayrsquos historiandealing with conventional sources often has to physically travel to anarchive requisition boxes perhaps take photographs of materials iftime is tight and otherwise take notes With born-digital sourceshowever the formula is reversed Our sources can come to us In a

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

few keystrokes we can bring an unparalleled quantity of informationto our home or work computers With Canadarsquos Digital Collectionsa single command line entry brought the entire archive onto my sys-tem it took days to process the information and I had to leave mycomputer on but it did not require any active user engagement

How big is the CDC collection in terms of sheer informationThe plaintext without any HTML markup is approximately7805135 words If each page contained 250 words it would be theequivalent of a 31220 page essay or 1000 dissertations However itis even more complicated than just reading 1000 dissertations Theinformation is spread across 73585 separate HTML files themselvespart of a much broader corpus of 278785 files Even the sheer sizeof the database requires big data methodologies an aggregate size of726 GB 360 MB of plaintext with HTML encoding reduced to alsquomerersquo 50MB if all encoding is removed A 50MB plain text docu-ment can quickly slow down even a relatively powerful personalcomputer There is no simple way to make this a conventionallyaccessible amount of information

How can we access this information Several different methodsare possible First we can work with the information on the internetitself or we can download it onto our own system I prefer the lattermethod as it decreases computational time adds flexibility andenables free rein to experiment with material Before you do thisthough you need to be familiar with the websitersquos ldquoterms of servicerdquoor the legal documentation that outlines how users are allowed to usea website resources or software In this case LAC provides generousnon-commercial reproduction terms information ldquomay be repro-duced in part or in whole and by any means without charge orfurther permission unless otherwise specifiedrdquo51

The CDC web archive index is organized like so the index islocated at wwwepelac-bacgcca100205301iccdcEAlphabetaspwith every subsequent website branching off the cdc folder Forexample one website would be hosted in wwwepelac-bacgcca100205301iccdcABbenevolat the next in hellipcdcabnatureand so on Using wget a free software program you can downloadall websites in a certain structure the program recursively downloadsinformation in a methodical fashion I first used wget to save the

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

cookies (little files kept on your system in this case one that notedthat you had seen the disclaimer about the archival nature of thewebsite) and then with an one-line command in my command lineto download the entire database52 Due to the imposition of a ran-dom wait and bandwidth limit which meant that I would downloadone file every half to one and a half seconds at a maximum of 30kilobytes a second the process took six days In the ProgrammingHistorian 2 I have an article expanding upon wget for humanitiesresearchers

Hypertext itself deserves a brief discussion The basic buildingblock of the web is the link one page is connected to another andpeople navigate through these webs While books and articles forexample are connected to each other by virtue of ideas and concretecitations these must often be inferred online HTML born-digitalsources are different in that the link offers a very concrete file One

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Visualizing the CDC Collection as a Series of Hyperlinks

quick visualization the circular cluster seen above lets us see theunique nature of this information We can go to each web pagescrape the links and see where they take viewers As we see in thebelow figure the CDC collection can be visualized as a series ofhyperlinks (each line) The cluster in the middle is the CDC collec-tion itself whereas the strands on the outside represent externalwebsites Using Mathematica I have generated a dynamic modelyou can move your mouse over each line learn what link it is andthus see where the collection fits into the broader Internet The siteis limited to following two links beyond the first page Through thisapproach we see the interconnection of knowledge and relationshipsbetween different sources At a glimpse we can see suggestions to fol-low other sites and determine if any nodes are shared amongst thevarious websites revealed

Visualization Moving Beyond Word Clouds

Where to begin with a large collection such as the Canadarsquos DigitalCollection corpus As a historian who is primarily interested in thewritten word my first step was to isolate the plaintext throughout thematerial so that I could begin rudimentary data mining and textualanalysis on the corpus For this I approached the CDC collection ontwo distinct levels mdash first the aggregate collection and second divid-ing the collection into each lsquofilersquo or individual website mdash derivingword and phrase data for each Imagine if you could look throughinto an archival collection and get a quick sense of what topics andideas were covered Taking that thought further imagine the utility ifyou could do this with individual archival boxes and even files Witha plethora of born-digital sources this is now a necessary step

On an aggregate level word and phrase data is a way to explorethe information within How often do various words within thenearly eight million words of the collection appear UsingMathematica we can quickly generate a list of words sorted by theirrelative frequency we learn that the word ldquoCanadardquo appears 26429times ldquocollectionsrdquo 11436 and ldquo[L]ouisbourgrdquo an astounding9365 times Reading such a list can quickly wear out a reader and isof minimal visual attractiveness Two quick decisions were made to

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

enhance data usability first the information was ldquocase-foldedrdquo allsent to lower-case this enables us to count the total appearances of aword such as ldquodigitalrdquo regardless of whether or not it appears at thebeginning of a sentence

Second when dealing with word frequency mdash as opposed tophrase frequency mdash ldquostop wordsrdquo should be removed as they werehere These are ldquoextremely common words which would appear to beof little value in helping select documents matching a user needrdquowords such as ldquotherdquo ldquoandrdquo ldquotordquo and so forth They matter forphrases (the example ldquoto be or not to berdquo is comprised entirely ofstop words) but will quickly dominate any word-level visualizationor list53 Other words may need to be added or removed from stoplists depending on the corpus for example if the phrase ldquopage num-berrdquo appeared on every page that would be useless information forthe researcher

The first step then is to use Mathematica to format word fre-quency information into Wordlenet readable format (if we put allof the text into that website it will crash mdash so we need to followthe instructions on that website) we can then draw up a wordcloud of the entire collection Here we can see both the advantagesand disadvantages of static word clouds What can we learn from this

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A Word Cloud of All Words in the CDC Corpus

illustration First without knowing anything about the sites we canimmediately learn that it is dealing with CanadaCanadians that itis a collection with documents in French and English that we aredealing with digitized collections and that there is a fair amount ofregional representation not based upon population (note thatldquoontariordquo is less frequent than both ldquonewfoundlandrdquo ldquonova scotiardquoand ldquoalbertardquo although ldquoqueacutebecrdquo and ldquoquebecrdquo is not terribly promi-nent) Certain kinds of history dominate communities familynames the built form (houses churches schools and so forth) At avery macro level we can learn several characteristics before doing anydeeper investigation

These forms of visualization need to be used with extreme cau-tion however First while visually attractive it leaves a severeamount of ambiguity are ldquoontariordquo and ldquoqueacutebecrdquo equally promi-nent for example or is there a slight divergence Even moreproblematically we cannot learn context from the above picturedoes the word ldquoschoolrdquo refer to built structures or educational expe-riences The same question with applies to ldquochurchrdquo Do mentionsof ldquonewfoundlandrdquo refer to historical experiences there or withexpatriates or in travel narratives Word clouds are useful first stepsand have their place but need to be used with extreme caution Formore scholarly applications tables offer much more precise inter-pretation As the second figure below demonstrates they can bemore helpful when dealing with the individual sites cuttingthrough ambiguity The figure below is a visualization of the web-site ldquopast and presentrdquo Despite the ambiguous title we can quicklylearn what this website contains from this visualization Wordle isthus a useful first step analogous to taking magically taking a quicklook through the side of an archival box We still need more infor-mation however to really make this form of textual analysishelpful To navigate this problem I have created my own dynamicfinding aid program using Mathematica which scales to any born-digital web collection

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Two approaches were taken First a textual representation of wordand phrase frequency allows us to determine the content of a given filerelatively quickly I wrote a short program to extract word frequency(stop words excluded) and phrase frequency For phrases we speak ofn-grams bigrams trigrams quadgrams and fivegrams although onecould theoretically do any higher number A bigram can be a combina-tion of two characters such as lsquobirsquo two syllables or mdash pertinent in ourcase mdash two words An example ldquocanada isrdquo A trigram is three wordsquadgram is four words and a fivegram is five words and so on

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Word Cloud of the File ldquoPast and Presentrdquo

A viewer for n-grams in the CDC Collection

The CDC collection demonstrates this approachrsquos utility Usingmy Mathematica textual CDC browser pictured above I randomlyexplored file 267 the ambiguously named ldquopast and presentrdquo web-site This is one of the largest individual websites 122363 words orabout 493 standard typed pages Such an ambiguous name could beabout almost anything Initially I decided to study word frequency

ldquoldquoalbertardquo 758rdquo ldquoldquolistenrdquo 490rdquo ldquoldquosettlersrdquo 467rdquoldquoldquonewrdquo 454rdquo ldquoldquocanadardquo 388rdquo ldquoldquolandrdquo 362rdquo ldquoldquoreadrdquo328rdquo ldquoldquopeoplerdquo 295rdquo ldquoldquosettlementrdquo 276rdquo ldquoldquochineserdquo268rdquo ldquoldquocanadianrdquo 264rdquo ldquoldquoplacerdquo 255rdquo ldquoldquoheritagerdquo254rdquo ldquoldquoedmontonrdquo 250rdquo ldquoldquoyearsrdquo 238rdquo ldquoldquowestrdquo 236rdquoldquoldquoraymondrdquo 218rdquo ldquoldquoukrainianrdquo 216rdquo ldquoldquocamerdquo 213rdquoldquoldquocalgaryrdquo 209rdquo ldquoldquonamesrdquo 206rdquo ldquoldquoranchrdquo 204rdquoldquoldquoupdatedrdquo 193rdquo ldquoldquohistoryrdquo 190rdquo ldquoldquocommunitiesrdquo183rdquo hellip

Rather than having to visit the website at a glance we can almostimmediately see what the site is about More topics become apparentwhen one looks at trigrams for example in this automatically gen-erated output of generated top phrases

ldquoldquofirstrdquo ldquopeoplerdquo ldquosettlersrdquo 100rdquo ldquoldquotherdquo ldquounitedrdquoldquostatesrdquo 72rdquo ldquoldquoonerdquo ldquoofrdquo ldquotherdquo 59rdquo ldquoldquoofrdquo ldquotherdquoldquowestrdquo 58rdquo ldquoldquosettlersrdquo ldquolastrdquo ldquoupdatedrdquo 56rdquo ldquoldquopeoplerdquoldquosettlersrdquo ldquolastrdquo 56rdquo ldquoldquoalbertansrdquo ldquofirstrdquo ldquopeoplerdquo 56rdquoldquoldquoadventurousrdquo ldquoalbertansrdquo ldquofirstrdquo 56rdquo ldquoldquotherdquo ldquobarrdquoldquourdquo 55rdquo ldquoldquoopeningrdquo ldquoofrdquo ldquotherdquo 54rdquo ldquoldquocommunitiesrdquoldquoadventurousrdquo ldquoalbertansrdquo 54rdquo ldquoldquonewrdquo ldquocommunitiesrdquoldquoadventurousrdquo 54rdquo ldquoldquobarrdquo ldquourdquo ldquoranchrdquo 53rdquo ldquoldquolistenrdquoldquotordquo ldquotherdquo 52rdquo hellip

Through a few clicks we can navigate and determine the basiccontours mdash from a distance mdash of what the site pertains to In aviewer you can use a slider to dynamically see more informationor less information quickly toggling between bigrams trigramsand quadgrams allowing you to quickly hone in on immediately

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

relevant information This can be useful but we need to moveback even further to achieve a distant reading of this born-digitalcollection

If one zooms back out to this very distant level we can extractn-gram data from the entire collection We have two approachesfirst aggregate appearances of a given term (if ldquoaboriginalrdquo appears457 times then n=457) or second relative appearances of a giventerm (if ldquoaboriginalrdquo appears in 45 of 384 sites then n=45384 or01171875) In my other work on music lyrics the latter optionenables you to control for the lsquochorusrsquo effect as the former can be drowned out by one song with repetitive words Given thenature of this collection I elected for the former approach Whatcan we learn from trigrams for example After an initial set of trigrams pertaining to the Louisbourg digitization project whichhas boilerplate language on each page as well as common ldquostop-wordrdquo equivalents of phrases (especially pronounced in French) we come across the first set of meaningful phrases excerptedbelow

ldquoexperiencerdquo ldquoinrdquo ldquotherdquo 919 ldquotransferredrdquo ldquotordquo ldquotherdquo918 ldquofundedrdquo ldquobyrdquo ldquotherdquo 918 ldquoofrdquo ldquotherdquo ldquoprovincialrdquo915 ldquothisrdquo ldquowebrdquo ldquositerdquo 915 ldquo409rdquo ldquoregistryrdquo ldquo2rdquo914 ldquoarerdquo ldquonordquo ldquolongerrdquo 914 ldquogouvernementrdquo ldquodurdquo ldquocanadardquo 910 ldquouserdquo ldquothisrdquo ldquolinkrdquo 887ldquopagerdquo ldquoorrdquo ldquoyourdquo 887 ldquofollowingrdquo ldquopagerdquo ldquoorrdquo 887 ldquoautomaticallyrdquo ldquotransferredrdquo ldquotordquo 887 ldquoberdquoldquoautomaticallyrdquo ldquotransferredrdquo 887 ldquowillrdquo ldquoberdquo ldquoautomaticallyrdquo 887 ldquoindustryrdquo ldquoyourdquo ldquowillrdquo 887ldquomultimediardquo ldquoindustryrdquo ldquoyourdquo 887 ldquopracticalrdquo ldquoexperiencerdquo ldquoinrdquo 887 ldquogainedrdquo ldquopracticalrdquo ldquoexperiencerdquo887 ldquotheyrdquo ldquogainedrdquo ldquopracticalrdquo 887 ldquowhilerdquo ldquotheyrdquoldquogainedrdquo 887 ldquocanadiansrdquo ldquowhilerdquo ldquotheyrdquo 887ldquoyoungrdquo ldquocanadiansrdquo ldquowhilerdquo 887 ldquoshowcaserdquo ldquoofrdquoldquoworkrdquo 887 ldquoexcellentrdquo ldquoshowcaserdquo ldquoofrdquo 887 hellip

Here we see an array of common language funding informa-tion link information project information spread across over 887

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

websites Reading down we see information about governments jobexperience project information initiatives re-structuring and soforth At a glance we can see what language is common to all web-sites as opposed to unique information Moving down we find themore common but less universal ldquothe history ofrdquo for example ldquoher-itage community foundationrdquo libraries and so forth By the end ofthe list we can move into the unique or very rare phrases that appearonly once twice or three times

Moving beyond word and phrase frequency there are a fewmore sophisticated off-the-shelf options that can help us make senseof large arrays of unstructured data Topic modeling is among themost developed as Princetonrsquos David M Blei (a Computer Scienceprofessor) describes in a general introduction to the topic

While more and more texts are available online we simplydo not have the human power to read and study them toprovide the kind of [traditional] browsing experiencedescribed above To this end machine learning researchershave developed probabilistic topic modeling a suite of algo-rithms that aim to discover and annotate large archives ofdocuments with thematic information Topic modelingalgorithms are statistical methods that analyze the words ofthe original texts to discover the themes that run throughthem how those themes are connected to each other andhow they change over time hellip Topic modeling enables usto organize and summarize electronic archives at a scalethat would be impossible by human annotation55

Using the aforementioned SEASR Meandre workbench (or viaMALLET itself and a command-line interface) we can quickly beginto run our born-digital information through the topic-modelingalgorithm The results are notable and come in two formats a wordcloud visualization as well as an XML file containing a line-by-linebreakdown of each topic Taking the aforementioned file 267 (ldquopastand presentrdquo) we can begin to get a sense of the topics this large web-site discusses

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

At a glance we can learn about some of the main themes the struc-ture of Alberta (Alberta towns bridges lakes) the geographic shapeof the area (life land lands areas links grass and interestingly lan-guage) the stories (story knight archives but also interestinglyknight and wife) and so forth The word cloud gives us a sense ofthe most relevant terms that define each topic but we also have a3200-line XML file broken into topics which lets us further refineour inquiry

A last set of examples will help us move from the aggregate levelof hundreds of thousands of lines of text to the individual line inessence creating our own search engine to navigate this collection

There are a few good reasons why we should not use Google todo all of this in our own work First we do not know how Googleworks precisely It is a proprietary black box While the original algo-rithm is available the actual methods that go into determining theranking of search terms is a trade secret Second Google is optimizedfor a number of uses mainly to deliver the most relevant informa-tion to the user Historians and archivists are looking often forspecific information within old websites and it behooves us to exam-ine them through the lens of more specialized tools With thesehesitations in mind I have sought to develop my own rudimentarysearch engine for this born-digital collection

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A two-pronged approach can take us to relevant informationquickly In the following example I am carrying out a project onblack history We quickly search all of the CDC word frequency filesfor the occurrence of the word ldquofreedomrdquo ldquoblackrdquo and ldquolibertyrdquo(these terms of course require some background knowledge thesearch is shaped by the user) obtain a list and then map it onto achart in order to quickly visualize their locations in the collection Forexample in the graph above we see a visualization of a given term inthis case our three search terms As one moves the mouse over eachbar the frequency data for a given term appears We can also zoomand enhance this graph to use it as we wish A demonstration of thiscan be found online at wwwianmilliganca20130118cdc-tool-1

We can then move down into the data In the example aboveldquo047rdquo labeled ldquoblackgoldtxtrdquo contains 31 occurrences of the wordldquoontariordquo Moving to my Mathematica keyword-in-context routinewe can see where these occurrences of the word appear

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

A snapshot of the dynamic finding aid tool

We see that this website focuses mostly on southwestern Ontariowhere many oil fields are If we move forward we can see that it per-tains to the nineteenth century a particular region as well assubsequent memorialization activities (mention of a museum forexample) All this can be done from within the textual client asopposed to going out into individual websites Broad contoursemerge from this distant reading but they also facilitate deeper read-ing by allowing us to promptly realize critical sources

Another model through which we can visualize documents isthrough a viewer primarily focusing on term document frequencyand inverse frequency terms discussed below To prepare this mater-ial we are combining our earlier data about word frequency (that theword ldquoexperiencerdquo appears 919 times perhaps) with aggregate infor-mation about frequency in the entire corpus (ie the wordldquoexperiencerdquo appears 1766 times) The number of times that a termappears in a given website can be defined as tf (term frequency)Another opportunity is inverse document frequency where we canlearn what words are most distinct and unique in a given documentThe formula for this is the oft-cited tf-idf a staple of informationretrieval theory or tf times the logarithm of the number of docu-ments divided into the number of documents where the given termappears At a glance then we can see some of the unique words con-tained within a website if we select ldquoinverse document frequencyrdquo

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

At a very quick glance one can see the major words that make this aunique document This technique can be used on any textual docu-ment

Off-the-shelf tools can facilitate a deeper yet distant reading ofindividual websites as well especially the suite of research tools onwebsites such as Many Eyes and Voyant Using these tools howeverrequires that one puts individual data into each separate programrather than quickly navigating between them This stems from themassive amount of files and data involved Many Eyes limits files to5MB and Voyant has no hard and fast rule but can struggle withlarger files

This analysis was carried out on my personal computer55 Largerdata sets may necessitate a corresponding increase in computationalpower bringing social sciences and humanities researchers into con-tact with High Performance Computing (HPC) centres or consortiaThese extremely powerful computers provide researchers with com-puting power orders of magnitude above that of personal computersallowing results in hours instead of the weeks or months it might

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

take to process on a smaller system56 This has particular applicationin the basic processing of a very large web archive There are limita-tions however jobs must be submitted into queues in advancerather than on-site processing and perhaps most importantly HPCcentres prefer to run highly optimized code to minimize usage oftheir very expensive and scarcely allocated equipment As this level ofcoding expertise is often beyond humanities researchers researchteams must be formed

Humanities researchers would do well to familiarize themselveswith local HPC options many of which are more than happy towork on new and engaging questions These are accessible optionsin Ontario for example all but one university belongs to a HPCconsortium57 Yet as personal computing continues to dramaticallyincrease in speed and as storage prices continue to plummet textualanalysis is increasingly feasible on high-quality office systems An ini-tial exposure can generate questions and problems that if the needrears itself can subsequently be taken to the supercomputer

This case study is intended to provide one model for distantreading and the generation of finding aids in the interest of showingsome of the potential for working with born-digital sources It isintended to begin or jumpstart a conversation around the resourcesand skills necessary for the next generation A seemingly insur-mountable amount of information contained across over 73000files (as in the CDC collection) can be quickly navigated and iso-lated for researcher convenience

Conclusion Digital Literacy for the Next Generation ofHistorians

As a discipline I would like to suggest one path forward prepare thenext generation of historians at least some of them to deal with thedeluge of born-digital sources Fundamentally the challenge of born-digital sources is a pedagogical one For existing historians this is anissue of continual education for future historians an issue of profes-sional training (the annual Digital Humanities Summer Instituteheld at the University of Victoria each June is worth considering asan introduction to many digital concepts) Some off-the-shelf tools

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

can serve as gateways to deeper engagement notably the VoyantTools suite (wwwvoyant-toolsorg) and perhaps subsequently thelessons contained in the Programming Historian 2 That said digitalliteracy courses should become part of the essential pantheon of his-torical curricula This should not be mandatory but instead taughtalongside and within traditional historiography and research meth-ods courses In an era of diminishing resources and attention paid toknowledge translation and relevancy digital literacy courses are alsowell positioned to provide graduates with concrete portfolios andactionable skills If we want to be prepared to use digital method-ologies to study the post-1990 world we will need ready historians

Such a course should have six basic priorities First we must pro-vide historiographical context for new students on whatcomputational methods have been tried before How have quantita-tive and qualitative methods differed yet ultimately complementedeach other Second students in this course would need to develop anunderstanding of the digital basics cloud computing effectivebackup and versioning digital security and an understanding of howthe Internet is organized Third for effective cross-departmental andsustained engagement with digital history students must to learn howto digitally navigate conventional sources notable skill sets wouldinclude a familiarity with citation management software databasesand the possibility of using optical character recognition on a varietyof born-paper sources subsequently digitized through scanners orcameras This must also include a grasp of the scope and significanceof large digital depositories including Google Books the InternetArchive and others mentioned earlier Fourth students must learnhow to process large amounts of digital information this includes off-the-shelf visualization tools material discussed here and some of thephilosophical and methodological implications of a distance readingFifth students can learn basic programming through the ProgrammingHistorian Finally and perhaps most importantly students need anunderstanding of various ways to openly disseminate material on the web whether through blogging micro-blogging publishing webpages and so forth it is important to enable the great work that we do in the university to make its mark on the world Throughthis students will also come away from history departments with

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

comprehensive and marketable digital portfolios Such a coursewould focus more on homework than reading with special emphasisplaced on deliverable content such as websites interactive modelsand broadly implementable databases

Digital history and the readiness to work with born-digitalsources cannot stem out of pedagogy alone however Institutionalsupport is necessary While based on anecdotal evidence it appearsthat some university administrators have begun to embrace the shifttowards the digital humanities At the federal level SSHRC has iden-tified the Digital Economy as one of its priority areas58 Yet when itcomes to evaluation doctoral and post-doctoral researchers do nothave a ldquodigital humanitiesrdquo option open to them in terms of assessingcommittees leading to fears that their projects might be lost inbetween disciplinary areas of focus59 In light of the impending crisisof born-digital sources historians should encourage these initiatives

These sorts of digital projects do not always lend themselveswell to traditional publication venues The examples above from theCDC case study for example are best viewed as dynamically mal-leable models this will be one challenge as historians grapple withnew publication models This article argues first why digitalmethodologies are necessary and second how we can realize aspectsof this call The Internet is 20 years old and before historians realizeit we will be faced with the crisis of born-digital information over-load If history is to continue as the leading discipline inunderstanding the social and cultural past decisive movementtowards the digital humanities is necessary When the sea comes inwe must be ready to swim

IAN MILLIGAN is an assistant professor of Canadian history at theUniversity of Waterloo He is a founding co-editor of ActiveHistorycaan editor-at-large of the Programming Historian 2 and has written sev-eral articles in the fields of Canadian youth digital and labour history

IAN MILLIGAN est professeur adjoint en histoire canadienne agrave laUniversity of Waterloo Il est le fondateur et codirecteur du site

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Internet ActiveHistoryca ainsi qursquoun eacutediteur du site ProgrammingHistorian 2 Il a eacutecrit plusieurs articles au sujet de la jeunesse cana-dienne de la numeacuterisation et de lrsquohistoire des travailleurs

Endnotes

1 The question of when the past becomes a ldquohistoryrdquo to be professionallystudied is a tricky one Yet it is worth noting that the still-definitivework on the Baby Boom Generation appeared in 1996 with DougOwram Born at the Right Time A History of the Baby Boom Generationin Canada (Toronto University of Toronto Press 1996) This actuallyfollowed the earlier scholarly monograph Cyril Levitt Children ofPrivilege Student Revolt in the Sixties (Toronto University of TorontoPress 1984) and a host of other laudable works As academicdebatesworks on the 1960s now suggest the past becomes historysooner than we often expect

2 A critical point advanced in the field of literary criticism and theory byStephen Ramsay Reading Machines Towards an Algorithmic Criticism(Urbana University of Illinois Press 2011)

3 See ldquoIdleNoMore Timelinerdquo ongoingwwwidlenomoremakookcatimeline44 ltviewed 18 January 2013gt

4 These issues have been discussed elsewhere albeit in either a moreabridged form or with a focus on digitized historical sources A notableexception is Daniel J Cohen and Roy Rosenzweig Digital History AGuide to Gathering Preserving and Presenting the Past on the Web(Philadelphia University of Pennsylvania Press 2005) specifically theirchapter on ldquoCollecting History onlinerdquo Their focus is on the preservation-ist and archival challenges however Important notes are also found inDan Cohen ldquoHistory and the Second Decade of the Webrdquo RethinkingHistory 8 no 2 (June 2004) 293ndash301 also available online atwwwchnmgmueduessays-on-history-new-mediaessaysessayid=34

5 Figure taken from Jose Martinez ldquoTwitter CEO Dick Costolo Reveals Staggering Number of Tweets Per Dayrdquo ComplexTech wwwcomplexcomtech201210twitter-ceo-dick-costolo-reveals-staggering-number-of-tweets-per-day ltviewed 6 January 2013gtFurther more while a tweet itself is only 140 characters every tweetcomes with over 40 lines of metadata date information author biogra-phy and location information when the account was created the usersrsquofollowing and follower numbers country data and so forth This is aremarkable treasure trove of information for future historians but also apotential source of information overload See Sarah Perez ldquoThis is What

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

a Tweet Looks Likerdquo ReadWriteWeb (19 April 2010) wwwreadwritewebcomarchivesthis_is_what_a_tweet_looks_likephpltviewed 6 January 2013gt

6 This is further discussed in Ian Foster ldquoHow Computation ChangesResearchrdquo in Switching Codes Thinking Through Digital Technology inthe Humanities and the Arts eds Thomas Bartscherer and RoderickCoover (Chicago University of Chicago Press 2011) esp 22ndash4

7 Franco Moretti Graphs Maps Trees Abstract Models for Literary History(New York Verso 2005) 3ndash4

8 While not of prime focus here the longue dureacutee deserves definition Itstands opposite the history of events of instances covering instead avery long time span in an interdisciplinary social science framework In along essay Braudel noted that one can then see both structural crises ofeconomies and structures that constrain human society and develop-ment For more see Fernand Braudel ldquoHistory and the Social SciencesThe Longue Dureacuteerdquo in Fernand Braudel On History trans SarahMatthews (Chicago University of Chicago Press 1980) 27 The essaywas originally printed in the Annales ESC no 4 (OctoberndashDecember1958)

9 A great overview can be found at ldquoText Analysisrdquo Tooling Up for DigitalHumanities 2011 wwwtoolingupstanfordedupage_id=981 ltviewed19 December 2012gt

10 Jean-Baptiste Michel Yuan Kui Shen Aviva Presser Aiden Adrian VeresMatthew K Gray The Google Books Team Joseph P Pickett DaleHoiberg Dan Clancy Peter Norvig Jon Orwant Steven Pinker MartinA Nowak Erez Lieberman Aiden ldquoQuantitative Analysis of CultureUsing Millions of Digitized Booksrdquo Science 331 (14 January 2011)176ndash82 The tool is publicly accessible at ldquoGoogle Books NgramViewerrdquo Google Books 2010 wwwbooksgooglecomngrams ltviewed11 January 2013gt

11 Anthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

12 Comment by Jean-Baptiste Michel and Erez Lieberman Aiden onAnthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

13 For more information see the Canadian Century Research InfrastructureProject website at wwwccriuottawacaCCRIHomehtml and the Programme de recherche en deacutemographie historique at wwwgeneologyumontrealcafrleprdhhtm

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

14 ldquoWith Criminal Intentrdquo CriminalIntentorg wwwcriminalintentorgltviewed 20 December 2012gt

15 For an overview see ldquoAward Recipients mdash 2011rdquo Digging into DataWebpage wwwdiggingintodataorgHomeAwardRecipients2011tabid185Defaultaspx ltviewed 20 December 2012gt

16 One can do no better than the completely free Programming Historianwhich uses entirely free and open-source programming tools to learn thebasics of computational history See The Programming Historian 2 2012-Present wwwprogramminghistorianorg ltviewed 11 January 2013gt

17 Michael Katz The People of Hamilton Canada West Family and Class ina Mid-Nineteenth-Century City (Cambridge MA Harvard UniversityPress 1975) Also see A Gordon Darroch and Michael D OrnsteinldquoEthnicity and Occupational Structure in Canada in 1871 The VerticalMosaic in Historical Perspectiverdquo Canadian Historical Review 61 no 3(1980) 305ndash33

18 Ian Anderson ldquoHistory and Computingrdquo Making History The ChangingFace of the Profession in Britain 2008 wwwhistoryacukmakinghistoryresourcesarticleshistory_and_computinghtml ltviewed 20 December2012gt

19 Christopher Collins ldquoText and Document Visualizationrdquo TheHumanities and Technology Camp Greater Toronto Area VisualizationBootcamp University of Ontario Institute of Technology Oshawa Ont21 October 2011

20 The quotations by Luther and Poe are commonly used to make thispoint For the best example see Clay Shirky Cognitive SurplusCreativity and Generosity in a Connected Age Google eBook Penguin2010 For Mumford see Lewis Mumford The Myth of the Machine vol2 The Pentagon of Power (New York Harcourt Brace 1970) 182 asquoted in Gleick The Information A History A Theory A Flood (NewYork Pantheon 2011) 404

21 Adam Ostrow ldquoCisco CRS-3 Download Library of Congress in JustOver One Secondrdquo Mashable (9 March 2010) accessed onlinewwwmashablecom20100309cisco-crs-3

22 A note on data size is necessary The basic unit of information is a ldquobyterdquo (8 bits) which is roughly a single character A kilobyte (KB) is athousand bytes a megabyte (MB) a million a gigabyte (GB) a billion aterabyte (TB) a trillion and a petabyte (PB) a quadrillion Beyond thisthere are exabytes (EB) zettabytes (ZB) and yottabytes (YB)

23 These figures can of course vary depending on compression methodschosen Their math was based on an average of 8MB per book if it wasscanned at 600dpi and then compressed There are 26 million books inthe Library of Congress For more see Mike Ashenfelder ldquoTransferringlsquoLibraries of Congressrsquo of Datardquo Library of Congress Digital Preservation

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Blog 11 July 2011wwwblogslocgovdigitalpreservation201107transferring-libraries-of-congress-of-data ltviewed 11 January 2013gt

24 Gleick 232 25 ldquoAbout the Library of Congress Web Archivesrdquo Library of Congress Web

Archiving FAQs Undated wwwlocgovwebarchivingfaqhtmlfaqs_02ltviewed 11 January 2013gt

26 ldquo10000000000000000 Bytes Archivedrdquo Internet Archive Blog 26October 2012 wwwblogarchiveorg2012102610000000000000000-bytes-archivedltviewed 19 December 2012gt

27 Jean-Baptiste Michel et al 176ndash82 Published Online Ahead of Print16 December 2010 wwwsciencemagorgcontent3316014176ltviewed 29 July 2011gt See also Scott Cleland ldquoGooglersquos PiracyLiabilityrdquo Forbes 9 November 2011 wwwforbescomsitesscottcle-land20111109googles-piracy-liability ltviewed 11 January 2013gt

28 ldquoGovernment of Canada Web Archiverdquo Library and Archives Canada 17October 2007 wwwcollectionscanadagccawebarchivesindex-ehtmlltviewed 11 January 2013gt

29 For background see Bill Curry ldquoVisiting Library and Archives inOttawa Not Without an Appointmentrdquo Globe and Mail 1 May 2012wwwtheglobeandmailcomnewspoliticsottawa-notebookvisiting-library-and-archives-in-ottawa-not-without-an-appointmentarticle2418960 ltviewed 22 May 2012gt and the announcement ldquoLAC Begins Implementation of New Approach to service Deliveryrdquo Library and Archives Canada February 2012wwwcollectionscanadagccawhats-new013-560-ehtml ltviewed 22May 2012gt For professional responses see the ldquoCHArsquos comment onNew Directions for Library and Archives Canadardquo CHA Website March2012 wwwcha-shccaenNews_39items19html ltviewed 19December 2012gt and ldquoSave Library amp Archives Canadardquo CanadianAssociation of University Teachers launched AprilMay 2012 wwwsavelibraryarchivesca ltviewed 19 December 2012gt

30 Ian Milligan ldquoThe Smokescreen of lsquoModernizationrsquo at Library andArchives Canadardquo ActiveHistoryca 22 May 2012wwwactivehistoryca201205the-smokescreen-of-modernization-at-library-and-archives-canada ltviewed 19 December 2012gt

31 ldquoSave LAC May 2012 Campaign Updaterdquo SaveLibraryArchivesca May2012 wwwsavelibraryarchivescaupdate-2012-05aspx ltviewed 19December 2012gt

32 The challenges and opportunities presented by such projects is extensively discussed in Daniel J Cohen ldquoThe Future of Preserving the Pastrdquo CRM The Journal of Heritage Stewardship 2 no 2

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

(Summer 2005) 6ndash19 also available freely online at the RoyRosenzweig Center for History and New Mediawwwchnmgmueduessays-on-history-new-mediaessaysessayid=39For more see ldquoSeptember 11 Digital Archive Saving the Histories of September 11 2001rdquo Centre for History and New Media andAmerican Social History Project Center for Media and Learningwww911digitalarchiveorg ltviewed 11 January 2013gt ldquoHurricaneDigital Memory Bank Collecting and Preserving the Stories of Katrina and Ritardquo Center for History and New Media wwwhurricanearchiveorg ltviewed 11 January 2013gt and ldquoOccupyArchive Archiving the Occupy Movements from 2011rdquo Centre forHistory and New Media wwwoccupyarchiveorg ltviewed 11 January2013gt

33 Daniel Cohen and Roy Rosenzweig Digital History A Guide to Gathering Preserving and Presenting the Past on the Web(Philadelphia Temple University Press 2005) chapter on ldquoPreservingDigital Historyrdquo The chapter is available fully online atwwwchnmgmuedudigitalhistorypreserving5php ltviewed 19December 2012gt

34 For an overview see ldquoPreserving Our Digital Heritage The NationalDigital Information Infrastructure and Preservation Program 2010 Report a Collaborative Initiative of the Library of CongressrdquoDigitalpreservationgov wwwdigitalpreservationgovmultimediadocumentsNDIIPP2010Report_Postpdf ltviewed 19 December 2012gt

35 Steve Meloan ldquoNo Way to Run a Culturerdquo Wired 13 February 1998available online via the Internet Archivewwwwebarchiveorgweb20000619001705httpwwwwiredcomnewsculture012841030100html ltviewed 19 December 2012gt

36 Some of this has been discussed in greater depth on my website See theseries beginning with Ian Milligan ldquoWARC Files A Challenge forHistorians and Finding Needles in Haystacksrdquo ianmilliganca 12December 2012 wwwianmilliganca20121212warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks ltviewed 20 December2012gt

37 Unstructured data as opposed to structured data is important for histo-rians but brings additional challenges Some projects have producedstructured data novels marked up using XML encoding for example soone can immediately see place names dates character names and soforth This makes it much easier for a computer to read and understand

38 See William Noblett ldquoDigitization A Cautionary Talerdquo New Review ofAcademic Librarianship 17 (2011) 3 Tobias Blanke Michael Bryantand Mark Hedges ldquoOcropodium Open Source OCR for Small-ScaleHistorical Archivesrdquo Journal of Information Science 38 no 1 (2012) 83

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 17: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

developers for the twofold purpose of developing digital skills andinstilling respect for and engagement with Canadian heritage andpreservation39 The collection initially called SchoolNet butrebranded as Canadarsquos Digital Collections in 1999 was accessiblefrom the CDC homepage After 2004 the program was shut downand the vast majority of the websites were archived on LACrsquos web-page This collection is now the largest born-digital Internet archivethat LAC holds worthy of study as content as well as thinkingabout methodological approaches

Conventional records about the program are unavailable in thetraditional LAC collection unsurprising in light of both its recentcreation and backlogs in processing and accessioning archival mate-rials We can use one element of our born-digital toolkit the InternetArchiversquos WaybackMachine (its archive of preserved early webpages)to get a sense of what this project looked like in its heyday The mainwebpage for the CDC was wwwcollectionsicgc However a visitortoday will see nothing but a collection of broken image links (two ofthem will bring you to the English and French-language versions ofthe LAC-hosted web archive although this is only discerniblethrough trial and error or through the sitersquos source code)

With the WaybackMachine we have a powerful albeit limitedopportunity to ldquogo back in timerdquo and experience the website as previ-ous web viewers would have (if a preserved website is available ofcourse) Among the over 420 million archived websites we can searchfor and find the original CDC site Between 12 October 1999 andthe most recent snapshot or web crawl (of the now-broken site) on 6July 2011 the site was crawled 410 times40 Unfortunately every siteversion until the 15 October 2000 crawl has been somehow cor-rupted Content has been instead replaced by Chinese characterswhich translate only into place names and other random words Issuesof early digital preservation noted above have reared their head inthis collection

From late 2000 onward however fully preserved CDC sites areaccessible Featured sites new entries success stories awards curric-ular units and prominent invitations to apply for project fundingpopulate the brightly coloured website The excitement is palpableJohn Manley then Industry Minister declared ldquoit is crucial that

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Canadarsquos young people get the skills and experience they need to par-ticipate in the knowledge-based economy of the futurerdquo and thatCDC ldquoshows that young people facing significant employment chal-lenges can contribute to and learn from the resources available onthe Information Highwayrdquo41 The documentation reflecting thelower level of technical expertise and scarcity of computer hardwareat the time was comprehensive participants were encouraged toidentify a collection digitize it process it (by cleaning up the data)preserve it on a hard drive and display it on the web Issues of copy-right were discussed as well as preservation issues and hardwarerequirements for computers (a computer with an Intel 386 processor8MB RAM a 1GB hard drive and 288bps modem was recom-mended) and training information for employers educators andparticipants42 Sample proposals community liaison ideas amidstother information set out a very comprehensive roadmap for a grass-roots bottom-up approach to digital preservation and heritage

Moving through the preserved websites is an interesting snap-shot at how web design standards changed Colour disappearsreplaced by a more subdued coordinated federal website standard(still largely in effect today) Surveys greet viewers from 2003onwards trying to learn how viewers discovered the site whetherthey think it is a good educational resource (they must have beenrelieved that the majority thought so mdash in 2003 some 84 percentfelt that it was a ldquogoodrdquo resource the best of its type)43 By 2004 theprogram was engaging in active remembrance of its past programsas it neared its ten-year anniversary Proposals stopped beingaccepted in July 2004 but the 600 webpages were still accessiblewith a comprehensive subject index alphabetical listing success sto-ries and detailed copyright information

The Canadarsquos Digital Collections site began to move towardsremoval in November 2006 and was initially completely gone A visi-tor to the site would learn that the content ldquois no longer availablerdquo andan e-mail address was provided for contributors to get their informationback44The information was completely inaccessible and by early 2007a bold notice announced that the now-empty site would close as of 2April 2007 Luckily by April 2007 a notice appeared with a declara-tion that the collections were now archived by Library and Archives

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Canada just in time for the website to completely disappear by July2007 In the archived version subject indexes and headings were goneAll that remained then and today for that matter is a straight-forwardalphabetical list and a disclaimer that noted ldquo[y]ou are viewing a doc-ument archived by Library and Archives Canada Please noteinformation may be out of date and some functionality lostrdquo

The question now is what we can do with this collection ofwebsites Could a distant reading as opposed to a close reading ona targeted level tell us something about this collectionFortunately we have a variety of tools that can help us navigate thislarge array of information several of which I use to navigate largeamounts of data First I create my own tools through computerprogramming This work is showcased in the following section I program in Wolfram Researchrsquos Mathematica 9 Mathematica is anintegrated platform for technical computing that allows users toprocess visualize and interact with exceptional arrays of informa-tion45 It is a language with an engaged and generous user base thatcomplements exhaustive well-written documentation fromWolfram Research To emphasize its ease one line of code couldimport a website in perfectly readable plaintext (avoiding unicodeerrors) as follows input=Import[ldquohttpactivehis-torycaldquordquoPlaintextrdquo] and a second one could make itall in lower text for textual analysis as such lowerpage=ToLowerCase[input] Variables are a critical building blockIn the former example ldquoinputrdquo now holds the entire plaintext of thewebsite ActiveHistoryca and then ldquolowerpagerdquo holds it in lowercaseOnce we have lower-cased information we are quickly able to extractthe following frequent words frequent phrases and could thus cre-ate a dynamic database of n-grams word clouds or run other textualanalysis programs on the material Much of the syntax adheres to arecognizable English-language format

Even more promising is Mathematicarsquos integration with theWolfram|Alpha database (itself accessible at wolframalphacom) acomputational knowledge engine that promises to put incredibleamounts of information at a userrsquos fingertips With a quick com-mand one can access the temperature at Torontorsquos Lester B PearsonInternational Airport at 500 pm on 18 December 1983 the rela-

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

tive popularity of names such as ldquoEdithrdquo or ldquoEmmardquo the Canadianunemployment rate in March 1985 or the Canadian GrossDomestic Product in December 199546 This allows one to quicklycross-reference findings with statistical information or find coinci-dental points of comparison As universities are now being activelyencouraged to contribute their information to Wolfram|Alpha thiswill become an increasingly powerful database

For many researchers the downside of Mathematica is a con-siderable one It is proprietary software and relatively expensive forstudent and non-student alike Outputs however can be dissemi-nated using the Computational Document Format (CDF) enablinghistorians to put their work online and entice the general public todynamically work with their models and findings

Programming is not the be all and end all Information visual-ization in the humanities is a growing international field withCanadian scholars heavily involved The most important and acces-sible example is Voyant Tools47 It is a powerful suite of textualanalysis tools Uploading onersquos own text or collection of texts draw-ing on a webpage or analyzing existing bodies of work one sees thefollowing a ldquoword cloudrdquo of a corpus click on any word to see itsrelative rise and fall over time (word trends) access the original textand click on any word to see its contextual placement within indi-vidual sentences It ably facilitates the ldquodistancerdquo reading of anyreasonably-sized body of textual information and should be a begin-ning point for those interested in this field The only downside isthat Voyant can choke on very large amounts of information as it isa textual analysis rather than robust data mining tool

There are several other complementary tools A notable one is theMany Eyes website hosted by IBM Research Many Eyes allows you toupload datasets be they textual tabular or numerical There are manyvisualization options provided you can plot relationships betweendata compare values (ie bar charts and bubble charts) track dataover time and apply information to maps The only downside toMany Eyes is that you must register and make your datasets available toall other registered researchers as well as IBM This limits your abilityto use copyrighted or proprietary information

More advanced technical users can avail themselves of a host of

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

textual analysis programs found elsewhere on the web The mostcomprehensive gathering of these tools can be found with theSoftware Environment for the Advancement of Scholarly Research(SEASR) and its Meandre workbench Meandre which can beinstalled as a local server on your system or mdash in a best case scenariomdash hosted by an institutional IT department allows you to tap intoa variety of workflows to analyze data48 One workflow mentionedin this article is the topic modeling toolkit found within the MachineLearning for Language Toolkit (MALLET) package which clustersunstructured data into ldquotopicsrdquo which appear together giving a quicksense of a documentrsquos structure49 Other workflows found withinSEASR include Dunningrsquos log-likelihood statistic which compares aselected document to a broader reference corpus For example youcould compare the lyrics that appear in 1968 against all lyrics in thepostwar era and learn what makes 1968 most unique50 sentimentanalyses which attempt to plot changes in emotional words andphrases over time and a variety of entity extraction algorithms(which find ldquopeoplerdquo ldquoplacesrdquo and so forth in a text) Corpuses thatwere previously too big to extract meaningful information from cannow be explored relatively quickly

As befits a technical side to history all of the freely availabletools discussed briefly here (notably wget SEASR and MALLET)are discussed in more technical terms in the peer-reviewedProgramming Historian 2 or on my own blog at ianmilliganca I haveelected to leave them online rather than provide commands in thisarticle they require some technical knowledge and directly workingfrom the web is quicker than here Please consider the following sec-tion an applied version of some of the lessons you can learn throughthese processes

From Web Collection to Dynamic Finding Aids

Let us briefly return to our archival metaphor Todayrsquos historiandealing with conventional sources often has to physically travel to anarchive requisition boxes perhaps take photographs of materials iftime is tight and otherwise take notes With born-digital sourceshowever the formula is reversed Our sources can come to us In a

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

few keystrokes we can bring an unparalleled quantity of informationto our home or work computers With Canadarsquos Digital Collectionsa single command line entry brought the entire archive onto my sys-tem it took days to process the information and I had to leave mycomputer on but it did not require any active user engagement

How big is the CDC collection in terms of sheer informationThe plaintext without any HTML markup is approximately7805135 words If each page contained 250 words it would be theequivalent of a 31220 page essay or 1000 dissertations However itis even more complicated than just reading 1000 dissertations Theinformation is spread across 73585 separate HTML files themselvespart of a much broader corpus of 278785 files Even the sheer sizeof the database requires big data methodologies an aggregate size of726 GB 360 MB of plaintext with HTML encoding reduced to alsquomerersquo 50MB if all encoding is removed A 50MB plain text docu-ment can quickly slow down even a relatively powerful personalcomputer There is no simple way to make this a conventionallyaccessible amount of information

How can we access this information Several different methodsare possible First we can work with the information on the internetitself or we can download it onto our own system I prefer the lattermethod as it decreases computational time adds flexibility andenables free rein to experiment with material Before you do thisthough you need to be familiar with the websitersquos ldquoterms of servicerdquoor the legal documentation that outlines how users are allowed to usea website resources or software In this case LAC provides generousnon-commercial reproduction terms information ldquomay be repro-duced in part or in whole and by any means without charge orfurther permission unless otherwise specifiedrdquo51

The CDC web archive index is organized like so the index islocated at wwwepelac-bacgcca100205301iccdcEAlphabetaspwith every subsequent website branching off the cdc folder Forexample one website would be hosted in wwwepelac-bacgcca100205301iccdcABbenevolat the next in hellipcdcabnatureand so on Using wget a free software program you can downloadall websites in a certain structure the program recursively downloadsinformation in a methodical fashion I first used wget to save the

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

cookies (little files kept on your system in this case one that notedthat you had seen the disclaimer about the archival nature of thewebsite) and then with an one-line command in my command lineto download the entire database52 Due to the imposition of a ran-dom wait and bandwidth limit which meant that I would downloadone file every half to one and a half seconds at a maximum of 30kilobytes a second the process took six days In the ProgrammingHistorian 2 I have an article expanding upon wget for humanitiesresearchers

Hypertext itself deserves a brief discussion The basic buildingblock of the web is the link one page is connected to another andpeople navigate through these webs While books and articles forexample are connected to each other by virtue of ideas and concretecitations these must often be inferred online HTML born-digitalsources are different in that the link offers a very concrete file One

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Visualizing the CDC Collection as a Series of Hyperlinks

quick visualization the circular cluster seen above lets us see theunique nature of this information We can go to each web pagescrape the links and see where they take viewers As we see in thebelow figure the CDC collection can be visualized as a series ofhyperlinks (each line) The cluster in the middle is the CDC collec-tion itself whereas the strands on the outside represent externalwebsites Using Mathematica I have generated a dynamic modelyou can move your mouse over each line learn what link it is andthus see where the collection fits into the broader Internet The siteis limited to following two links beyond the first page Through thisapproach we see the interconnection of knowledge and relationshipsbetween different sources At a glimpse we can see suggestions to fol-low other sites and determine if any nodes are shared amongst thevarious websites revealed

Visualization Moving Beyond Word Clouds

Where to begin with a large collection such as the Canadarsquos DigitalCollection corpus As a historian who is primarily interested in thewritten word my first step was to isolate the plaintext throughout thematerial so that I could begin rudimentary data mining and textualanalysis on the corpus For this I approached the CDC collection ontwo distinct levels mdash first the aggregate collection and second divid-ing the collection into each lsquofilersquo or individual website mdash derivingword and phrase data for each Imagine if you could look throughinto an archival collection and get a quick sense of what topics andideas were covered Taking that thought further imagine the utility ifyou could do this with individual archival boxes and even files Witha plethora of born-digital sources this is now a necessary step

On an aggregate level word and phrase data is a way to explorethe information within How often do various words within thenearly eight million words of the collection appear UsingMathematica we can quickly generate a list of words sorted by theirrelative frequency we learn that the word ldquoCanadardquo appears 26429times ldquocollectionsrdquo 11436 and ldquo[L]ouisbourgrdquo an astounding9365 times Reading such a list can quickly wear out a reader and isof minimal visual attractiveness Two quick decisions were made to

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

enhance data usability first the information was ldquocase-foldedrdquo allsent to lower-case this enables us to count the total appearances of aword such as ldquodigitalrdquo regardless of whether or not it appears at thebeginning of a sentence

Second when dealing with word frequency mdash as opposed tophrase frequency mdash ldquostop wordsrdquo should be removed as they werehere These are ldquoextremely common words which would appear to beof little value in helping select documents matching a user needrdquowords such as ldquotherdquo ldquoandrdquo ldquotordquo and so forth They matter forphrases (the example ldquoto be or not to berdquo is comprised entirely ofstop words) but will quickly dominate any word-level visualizationor list53 Other words may need to be added or removed from stoplists depending on the corpus for example if the phrase ldquopage num-berrdquo appeared on every page that would be useless information forthe researcher

The first step then is to use Mathematica to format word fre-quency information into Wordlenet readable format (if we put allof the text into that website it will crash mdash so we need to followthe instructions on that website) we can then draw up a wordcloud of the entire collection Here we can see both the advantagesand disadvantages of static word clouds What can we learn from this

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A Word Cloud of All Words in the CDC Corpus

illustration First without knowing anything about the sites we canimmediately learn that it is dealing with CanadaCanadians that itis a collection with documents in French and English that we aredealing with digitized collections and that there is a fair amount ofregional representation not based upon population (note thatldquoontariordquo is less frequent than both ldquonewfoundlandrdquo ldquonova scotiardquoand ldquoalbertardquo although ldquoqueacutebecrdquo and ldquoquebecrdquo is not terribly promi-nent) Certain kinds of history dominate communities familynames the built form (houses churches schools and so forth) At avery macro level we can learn several characteristics before doing anydeeper investigation

These forms of visualization need to be used with extreme cau-tion however First while visually attractive it leaves a severeamount of ambiguity are ldquoontariordquo and ldquoqueacutebecrdquo equally promi-nent for example or is there a slight divergence Even moreproblematically we cannot learn context from the above picturedoes the word ldquoschoolrdquo refer to built structures or educational expe-riences The same question with applies to ldquochurchrdquo Do mentionsof ldquonewfoundlandrdquo refer to historical experiences there or withexpatriates or in travel narratives Word clouds are useful first stepsand have their place but need to be used with extreme caution Formore scholarly applications tables offer much more precise inter-pretation As the second figure below demonstrates they can bemore helpful when dealing with the individual sites cuttingthrough ambiguity The figure below is a visualization of the web-site ldquopast and presentrdquo Despite the ambiguous title we can quicklylearn what this website contains from this visualization Wordle isthus a useful first step analogous to taking magically taking a quicklook through the side of an archival box We still need more infor-mation however to really make this form of textual analysishelpful To navigate this problem I have created my own dynamicfinding aid program using Mathematica which scales to any born-digital web collection

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Two approaches were taken First a textual representation of wordand phrase frequency allows us to determine the content of a given filerelatively quickly I wrote a short program to extract word frequency(stop words excluded) and phrase frequency For phrases we speak ofn-grams bigrams trigrams quadgrams and fivegrams although onecould theoretically do any higher number A bigram can be a combina-tion of two characters such as lsquobirsquo two syllables or mdash pertinent in ourcase mdash two words An example ldquocanada isrdquo A trigram is three wordsquadgram is four words and a fivegram is five words and so on

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Word Cloud of the File ldquoPast and Presentrdquo

A viewer for n-grams in the CDC Collection

The CDC collection demonstrates this approachrsquos utility Usingmy Mathematica textual CDC browser pictured above I randomlyexplored file 267 the ambiguously named ldquopast and presentrdquo web-site This is one of the largest individual websites 122363 words orabout 493 standard typed pages Such an ambiguous name could beabout almost anything Initially I decided to study word frequency

ldquoldquoalbertardquo 758rdquo ldquoldquolistenrdquo 490rdquo ldquoldquosettlersrdquo 467rdquoldquoldquonewrdquo 454rdquo ldquoldquocanadardquo 388rdquo ldquoldquolandrdquo 362rdquo ldquoldquoreadrdquo328rdquo ldquoldquopeoplerdquo 295rdquo ldquoldquosettlementrdquo 276rdquo ldquoldquochineserdquo268rdquo ldquoldquocanadianrdquo 264rdquo ldquoldquoplacerdquo 255rdquo ldquoldquoheritagerdquo254rdquo ldquoldquoedmontonrdquo 250rdquo ldquoldquoyearsrdquo 238rdquo ldquoldquowestrdquo 236rdquoldquoldquoraymondrdquo 218rdquo ldquoldquoukrainianrdquo 216rdquo ldquoldquocamerdquo 213rdquoldquoldquocalgaryrdquo 209rdquo ldquoldquonamesrdquo 206rdquo ldquoldquoranchrdquo 204rdquoldquoldquoupdatedrdquo 193rdquo ldquoldquohistoryrdquo 190rdquo ldquoldquocommunitiesrdquo183rdquo hellip

Rather than having to visit the website at a glance we can almostimmediately see what the site is about More topics become apparentwhen one looks at trigrams for example in this automatically gen-erated output of generated top phrases

ldquoldquofirstrdquo ldquopeoplerdquo ldquosettlersrdquo 100rdquo ldquoldquotherdquo ldquounitedrdquoldquostatesrdquo 72rdquo ldquoldquoonerdquo ldquoofrdquo ldquotherdquo 59rdquo ldquoldquoofrdquo ldquotherdquoldquowestrdquo 58rdquo ldquoldquosettlersrdquo ldquolastrdquo ldquoupdatedrdquo 56rdquo ldquoldquopeoplerdquoldquosettlersrdquo ldquolastrdquo 56rdquo ldquoldquoalbertansrdquo ldquofirstrdquo ldquopeoplerdquo 56rdquoldquoldquoadventurousrdquo ldquoalbertansrdquo ldquofirstrdquo 56rdquo ldquoldquotherdquo ldquobarrdquoldquourdquo 55rdquo ldquoldquoopeningrdquo ldquoofrdquo ldquotherdquo 54rdquo ldquoldquocommunitiesrdquoldquoadventurousrdquo ldquoalbertansrdquo 54rdquo ldquoldquonewrdquo ldquocommunitiesrdquoldquoadventurousrdquo 54rdquo ldquoldquobarrdquo ldquourdquo ldquoranchrdquo 53rdquo ldquoldquolistenrdquoldquotordquo ldquotherdquo 52rdquo hellip

Through a few clicks we can navigate and determine the basiccontours mdash from a distance mdash of what the site pertains to In aviewer you can use a slider to dynamically see more informationor less information quickly toggling between bigrams trigramsand quadgrams allowing you to quickly hone in on immediately

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

relevant information This can be useful but we need to moveback even further to achieve a distant reading of this born-digitalcollection

If one zooms back out to this very distant level we can extractn-gram data from the entire collection We have two approachesfirst aggregate appearances of a given term (if ldquoaboriginalrdquo appears457 times then n=457) or second relative appearances of a giventerm (if ldquoaboriginalrdquo appears in 45 of 384 sites then n=45384 or01171875) In my other work on music lyrics the latter optionenables you to control for the lsquochorusrsquo effect as the former can be drowned out by one song with repetitive words Given thenature of this collection I elected for the former approach Whatcan we learn from trigrams for example After an initial set of trigrams pertaining to the Louisbourg digitization project whichhas boilerplate language on each page as well as common ldquostop-wordrdquo equivalents of phrases (especially pronounced in French) we come across the first set of meaningful phrases excerptedbelow

ldquoexperiencerdquo ldquoinrdquo ldquotherdquo 919 ldquotransferredrdquo ldquotordquo ldquotherdquo918 ldquofundedrdquo ldquobyrdquo ldquotherdquo 918 ldquoofrdquo ldquotherdquo ldquoprovincialrdquo915 ldquothisrdquo ldquowebrdquo ldquositerdquo 915 ldquo409rdquo ldquoregistryrdquo ldquo2rdquo914 ldquoarerdquo ldquonordquo ldquolongerrdquo 914 ldquogouvernementrdquo ldquodurdquo ldquocanadardquo 910 ldquouserdquo ldquothisrdquo ldquolinkrdquo 887ldquopagerdquo ldquoorrdquo ldquoyourdquo 887 ldquofollowingrdquo ldquopagerdquo ldquoorrdquo 887 ldquoautomaticallyrdquo ldquotransferredrdquo ldquotordquo 887 ldquoberdquoldquoautomaticallyrdquo ldquotransferredrdquo 887 ldquowillrdquo ldquoberdquo ldquoautomaticallyrdquo 887 ldquoindustryrdquo ldquoyourdquo ldquowillrdquo 887ldquomultimediardquo ldquoindustryrdquo ldquoyourdquo 887 ldquopracticalrdquo ldquoexperiencerdquo ldquoinrdquo 887 ldquogainedrdquo ldquopracticalrdquo ldquoexperiencerdquo887 ldquotheyrdquo ldquogainedrdquo ldquopracticalrdquo 887 ldquowhilerdquo ldquotheyrdquoldquogainedrdquo 887 ldquocanadiansrdquo ldquowhilerdquo ldquotheyrdquo 887ldquoyoungrdquo ldquocanadiansrdquo ldquowhilerdquo 887 ldquoshowcaserdquo ldquoofrdquoldquoworkrdquo 887 ldquoexcellentrdquo ldquoshowcaserdquo ldquoofrdquo 887 hellip

Here we see an array of common language funding informa-tion link information project information spread across over 887

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

websites Reading down we see information about governments jobexperience project information initiatives re-structuring and soforth At a glance we can see what language is common to all web-sites as opposed to unique information Moving down we find themore common but less universal ldquothe history ofrdquo for example ldquoher-itage community foundationrdquo libraries and so forth By the end ofthe list we can move into the unique or very rare phrases that appearonly once twice or three times

Moving beyond word and phrase frequency there are a fewmore sophisticated off-the-shelf options that can help us make senseof large arrays of unstructured data Topic modeling is among themost developed as Princetonrsquos David M Blei (a Computer Scienceprofessor) describes in a general introduction to the topic

While more and more texts are available online we simplydo not have the human power to read and study them toprovide the kind of [traditional] browsing experiencedescribed above To this end machine learning researchershave developed probabilistic topic modeling a suite of algo-rithms that aim to discover and annotate large archives ofdocuments with thematic information Topic modelingalgorithms are statistical methods that analyze the words ofthe original texts to discover the themes that run throughthem how those themes are connected to each other andhow they change over time hellip Topic modeling enables usto organize and summarize electronic archives at a scalethat would be impossible by human annotation55

Using the aforementioned SEASR Meandre workbench (or viaMALLET itself and a command-line interface) we can quickly beginto run our born-digital information through the topic-modelingalgorithm The results are notable and come in two formats a wordcloud visualization as well as an XML file containing a line-by-linebreakdown of each topic Taking the aforementioned file 267 (ldquopastand presentrdquo) we can begin to get a sense of the topics this large web-site discusses

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

At a glance we can learn about some of the main themes the struc-ture of Alberta (Alberta towns bridges lakes) the geographic shapeof the area (life land lands areas links grass and interestingly lan-guage) the stories (story knight archives but also interestinglyknight and wife) and so forth The word cloud gives us a sense ofthe most relevant terms that define each topic but we also have a3200-line XML file broken into topics which lets us further refineour inquiry

A last set of examples will help us move from the aggregate levelof hundreds of thousands of lines of text to the individual line inessence creating our own search engine to navigate this collection

There are a few good reasons why we should not use Google todo all of this in our own work First we do not know how Googleworks precisely It is a proprietary black box While the original algo-rithm is available the actual methods that go into determining theranking of search terms is a trade secret Second Google is optimizedfor a number of uses mainly to deliver the most relevant informa-tion to the user Historians and archivists are looking often forspecific information within old websites and it behooves us to exam-ine them through the lens of more specialized tools With thesehesitations in mind I have sought to develop my own rudimentarysearch engine for this born-digital collection

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A two-pronged approach can take us to relevant informationquickly In the following example I am carrying out a project onblack history We quickly search all of the CDC word frequency filesfor the occurrence of the word ldquofreedomrdquo ldquoblackrdquo and ldquolibertyrdquo(these terms of course require some background knowledge thesearch is shaped by the user) obtain a list and then map it onto achart in order to quickly visualize their locations in the collection Forexample in the graph above we see a visualization of a given term inthis case our three search terms As one moves the mouse over eachbar the frequency data for a given term appears We can also zoomand enhance this graph to use it as we wish A demonstration of thiscan be found online at wwwianmilliganca20130118cdc-tool-1

We can then move down into the data In the example aboveldquo047rdquo labeled ldquoblackgoldtxtrdquo contains 31 occurrences of the wordldquoontariordquo Moving to my Mathematica keyword-in-context routinewe can see where these occurrences of the word appear

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

A snapshot of the dynamic finding aid tool

We see that this website focuses mostly on southwestern Ontariowhere many oil fields are If we move forward we can see that it per-tains to the nineteenth century a particular region as well assubsequent memorialization activities (mention of a museum forexample) All this can be done from within the textual client asopposed to going out into individual websites Broad contoursemerge from this distant reading but they also facilitate deeper read-ing by allowing us to promptly realize critical sources

Another model through which we can visualize documents isthrough a viewer primarily focusing on term document frequencyand inverse frequency terms discussed below To prepare this mater-ial we are combining our earlier data about word frequency (that theword ldquoexperiencerdquo appears 919 times perhaps) with aggregate infor-mation about frequency in the entire corpus (ie the wordldquoexperiencerdquo appears 1766 times) The number of times that a termappears in a given website can be defined as tf (term frequency)Another opportunity is inverse document frequency where we canlearn what words are most distinct and unique in a given documentThe formula for this is the oft-cited tf-idf a staple of informationretrieval theory or tf times the logarithm of the number of docu-ments divided into the number of documents where the given termappears At a glance then we can see some of the unique words con-tained within a website if we select ldquoinverse document frequencyrdquo

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

At a very quick glance one can see the major words that make this aunique document This technique can be used on any textual docu-ment

Off-the-shelf tools can facilitate a deeper yet distant reading ofindividual websites as well especially the suite of research tools onwebsites such as Many Eyes and Voyant Using these tools howeverrequires that one puts individual data into each separate programrather than quickly navigating between them This stems from themassive amount of files and data involved Many Eyes limits files to5MB and Voyant has no hard and fast rule but can struggle withlarger files

This analysis was carried out on my personal computer55 Largerdata sets may necessitate a corresponding increase in computationalpower bringing social sciences and humanities researchers into con-tact with High Performance Computing (HPC) centres or consortiaThese extremely powerful computers provide researchers with com-puting power orders of magnitude above that of personal computersallowing results in hours instead of the weeks or months it might

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

take to process on a smaller system56 This has particular applicationin the basic processing of a very large web archive There are limita-tions however jobs must be submitted into queues in advancerather than on-site processing and perhaps most importantly HPCcentres prefer to run highly optimized code to minimize usage oftheir very expensive and scarcely allocated equipment As this level ofcoding expertise is often beyond humanities researchers researchteams must be formed

Humanities researchers would do well to familiarize themselveswith local HPC options many of which are more than happy towork on new and engaging questions These are accessible optionsin Ontario for example all but one university belongs to a HPCconsortium57 Yet as personal computing continues to dramaticallyincrease in speed and as storage prices continue to plummet textualanalysis is increasingly feasible on high-quality office systems An ini-tial exposure can generate questions and problems that if the needrears itself can subsequently be taken to the supercomputer

This case study is intended to provide one model for distantreading and the generation of finding aids in the interest of showingsome of the potential for working with born-digital sources It isintended to begin or jumpstart a conversation around the resourcesand skills necessary for the next generation A seemingly insur-mountable amount of information contained across over 73000files (as in the CDC collection) can be quickly navigated and iso-lated for researcher convenience

Conclusion Digital Literacy for the Next Generation ofHistorians

As a discipline I would like to suggest one path forward prepare thenext generation of historians at least some of them to deal with thedeluge of born-digital sources Fundamentally the challenge of born-digital sources is a pedagogical one For existing historians this is anissue of continual education for future historians an issue of profes-sional training (the annual Digital Humanities Summer Instituteheld at the University of Victoria each June is worth considering asan introduction to many digital concepts) Some off-the-shelf tools

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

can serve as gateways to deeper engagement notably the VoyantTools suite (wwwvoyant-toolsorg) and perhaps subsequently thelessons contained in the Programming Historian 2 That said digitalliteracy courses should become part of the essential pantheon of his-torical curricula This should not be mandatory but instead taughtalongside and within traditional historiography and research meth-ods courses In an era of diminishing resources and attention paid toknowledge translation and relevancy digital literacy courses are alsowell positioned to provide graduates with concrete portfolios andactionable skills If we want to be prepared to use digital method-ologies to study the post-1990 world we will need ready historians

Such a course should have six basic priorities First we must pro-vide historiographical context for new students on whatcomputational methods have been tried before How have quantita-tive and qualitative methods differed yet ultimately complementedeach other Second students in this course would need to develop anunderstanding of the digital basics cloud computing effectivebackup and versioning digital security and an understanding of howthe Internet is organized Third for effective cross-departmental andsustained engagement with digital history students must to learn howto digitally navigate conventional sources notable skill sets wouldinclude a familiarity with citation management software databasesand the possibility of using optical character recognition on a varietyof born-paper sources subsequently digitized through scanners orcameras This must also include a grasp of the scope and significanceof large digital depositories including Google Books the InternetArchive and others mentioned earlier Fourth students must learnhow to process large amounts of digital information this includes off-the-shelf visualization tools material discussed here and some of thephilosophical and methodological implications of a distance readingFifth students can learn basic programming through the ProgrammingHistorian Finally and perhaps most importantly students need anunderstanding of various ways to openly disseminate material on the web whether through blogging micro-blogging publishing webpages and so forth it is important to enable the great work that we do in the university to make its mark on the world Throughthis students will also come away from history departments with

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

comprehensive and marketable digital portfolios Such a coursewould focus more on homework than reading with special emphasisplaced on deliverable content such as websites interactive modelsand broadly implementable databases

Digital history and the readiness to work with born-digitalsources cannot stem out of pedagogy alone however Institutionalsupport is necessary While based on anecdotal evidence it appearsthat some university administrators have begun to embrace the shifttowards the digital humanities At the federal level SSHRC has iden-tified the Digital Economy as one of its priority areas58 Yet when itcomes to evaluation doctoral and post-doctoral researchers do nothave a ldquodigital humanitiesrdquo option open to them in terms of assessingcommittees leading to fears that their projects might be lost inbetween disciplinary areas of focus59 In light of the impending crisisof born-digital sources historians should encourage these initiatives

These sorts of digital projects do not always lend themselveswell to traditional publication venues The examples above from theCDC case study for example are best viewed as dynamically mal-leable models this will be one challenge as historians grapple withnew publication models This article argues first why digitalmethodologies are necessary and second how we can realize aspectsof this call The Internet is 20 years old and before historians realizeit we will be faced with the crisis of born-digital information over-load If history is to continue as the leading discipline inunderstanding the social and cultural past decisive movementtowards the digital humanities is necessary When the sea comes inwe must be ready to swim

IAN MILLIGAN is an assistant professor of Canadian history at theUniversity of Waterloo He is a founding co-editor of ActiveHistorycaan editor-at-large of the Programming Historian 2 and has written sev-eral articles in the fields of Canadian youth digital and labour history

IAN MILLIGAN est professeur adjoint en histoire canadienne agrave laUniversity of Waterloo Il est le fondateur et codirecteur du site

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Internet ActiveHistoryca ainsi qursquoun eacutediteur du site ProgrammingHistorian 2 Il a eacutecrit plusieurs articles au sujet de la jeunesse cana-dienne de la numeacuterisation et de lrsquohistoire des travailleurs

Endnotes

1 The question of when the past becomes a ldquohistoryrdquo to be professionallystudied is a tricky one Yet it is worth noting that the still-definitivework on the Baby Boom Generation appeared in 1996 with DougOwram Born at the Right Time A History of the Baby Boom Generationin Canada (Toronto University of Toronto Press 1996) This actuallyfollowed the earlier scholarly monograph Cyril Levitt Children ofPrivilege Student Revolt in the Sixties (Toronto University of TorontoPress 1984) and a host of other laudable works As academicdebatesworks on the 1960s now suggest the past becomes historysooner than we often expect

2 A critical point advanced in the field of literary criticism and theory byStephen Ramsay Reading Machines Towards an Algorithmic Criticism(Urbana University of Illinois Press 2011)

3 See ldquoIdleNoMore Timelinerdquo ongoingwwwidlenomoremakookcatimeline44 ltviewed 18 January 2013gt

4 These issues have been discussed elsewhere albeit in either a moreabridged form or with a focus on digitized historical sources A notableexception is Daniel J Cohen and Roy Rosenzweig Digital History AGuide to Gathering Preserving and Presenting the Past on the Web(Philadelphia University of Pennsylvania Press 2005) specifically theirchapter on ldquoCollecting History onlinerdquo Their focus is on the preservation-ist and archival challenges however Important notes are also found inDan Cohen ldquoHistory and the Second Decade of the Webrdquo RethinkingHistory 8 no 2 (June 2004) 293ndash301 also available online atwwwchnmgmueduessays-on-history-new-mediaessaysessayid=34

5 Figure taken from Jose Martinez ldquoTwitter CEO Dick Costolo Reveals Staggering Number of Tweets Per Dayrdquo ComplexTech wwwcomplexcomtech201210twitter-ceo-dick-costolo-reveals-staggering-number-of-tweets-per-day ltviewed 6 January 2013gtFurther more while a tweet itself is only 140 characters every tweetcomes with over 40 lines of metadata date information author biogra-phy and location information when the account was created the usersrsquofollowing and follower numbers country data and so forth This is aremarkable treasure trove of information for future historians but also apotential source of information overload See Sarah Perez ldquoThis is What

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

a Tweet Looks Likerdquo ReadWriteWeb (19 April 2010) wwwreadwritewebcomarchivesthis_is_what_a_tweet_looks_likephpltviewed 6 January 2013gt

6 This is further discussed in Ian Foster ldquoHow Computation ChangesResearchrdquo in Switching Codes Thinking Through Digital Technology inthe Humanities and the Arts eds Thomas Bartscherer and RoderickCoover (Chicago University of Chicago Press 2011) esp 22ndash4

7 Franco Moretti Graphs Maps Trees Abstract Models for Literary History(New York Verso 2005) 3ndash4

8 While not of prime focus here the longue dureacutee deserves definition Itstands opposite the history of events of instances covering instead avery long time span in an interdisciplinary social science framework In along essay Braudel noted that one can then see both structural crises ofeconomies and structures that constrain human society and develop-ment For more see Fernand Braudel ldquoHistory and the Social SciencesThe Longue Dureacuteerdquo in Fernand Braudel On History trans SarahMatthews (Chicago University of Chicago Press 1980) 27 The essaywas originally printed in the Annales ESC no 4 (OctoberndashDecember1958)

9 A great overview can be found at ldquoText Analysisrdquo Tooling Up for DigitalHumanities 2011 wwwtoolingupstanfordedupage_id=981 ltviewed19 December 2012gt

10 Jean-Baptiste Michel Yuan Kui Shen Aviva Presser Aiden Adrian VeresMatthew K Gray The Google Books Team Joseph P Pickett DaleHoiberg Dan Clancy Peter Norvig Jon Orwant Steven Pinker MartinA Nowak Erez Lieberman Aiden ldquoQuantitative Analysis of CultureUsing Millions of Digitized Booksrdquo Science 331 (14 January 2011)176ndash82 The tool is publicly accessible at ldquoGoogle Books NgramViewerrdquo Google Books 2010 wwwbooksgooglecomngrams ltviewed11 January 2013gt

11 Anthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

12 Comment by Jean-Baptiste Michel and Erez Lieberman Aiden onAnthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

13 For more information see the Canadian Century Research InfrastructureProject website at wwwccriuottawacaCCRIHomehtml and the Programme de recherche en deacutemographie historique at wwwgeneologyumontrealcafrleprdhhtm

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

14 ldquoWith Criminal Intentrdquo CriminalIntentorg wwwcriminalintentorgltviewed 20 December 2012gt

15 For an overview see ldquoAward Recipients mdash 2011rdquo Digging into DataWebpage wwwdiggingintodataorgHomeAwardRecipients2011tabid185Defaultaspx ltviewed 20 December 2012gt

16 One can do no better than the completely free Programming Historianwhich uses entirely free and open-source programming tools to learn thebasics of computational history See The Programming Historian 2 2012-Present wwwprogramminghistorianorg ltviewed 11 January 2013gt

17 Michael Katz The People of Hamilton Canada West Family and Class ina Mid-Nineteenth-Century City (Cambridge MA Harvard UniversityPress 1975) Also see A Gordon Darroch and Michael D OrnsteinldquoEthnicity and Occupational Structure in Canada in 1871 The VerticalMosaic in Historical Perspectiverdquo Canadian Historical Review 61 no 3(1980) 305ndash33

18 Ian Anderson ldquoHistory and Computingrdquo Making History The ChangingFace of the Profession in Britain 2008 wwwhistoryacukmakinghistoryresourcesarticleshistory_and_computinghtml ltviewed 20 December2012gt

19 Christopher Collins ldquoText and Document Visualizationrdquo TheHumanities and Technology Camp Greater Toronto Area VisualizationBootcamp University of Ontario Institute of Technology Oshawa Ont21 October 2011

20 The quotations by Luther and Poe are commonly used to make thispoint For the best example see Clay Shirky Cognitive SurplusCreativity and Generosity in a Connected Age Google eBook Penguin2010 For Mumford see Lewis Mumford The Myth of the Machine vol2 The Pentagon of Power (New York Harcourt Brace 1970) 182 asquoted in Gleick The Information A History A Theory A Flood (NewYork Pantheon 2011) 404

21 Adam Ostrow ldquoCisco CRS-3 Download Library of Congress in JustOver One Secondrdquo Mashable (9 March 2010) accessed onlinewwwmashablecom20100309cisco-crs-3

22 A note on data size is necessary The basic unit of information is a ldquobyterdquo (8 bits) which is roughly a single character A kilobyte (KB) is athousand bytes a megabyte (MB) a million a gigabyte (GB) a billion aterabyte (TB) a trillion and a petabyte (PB) a quadrillion Beyond thisthere are exabytes (EB) zettabytes (ZB) and yottabytes (YB)

23 These figures can of course vary depending on compression methodschosen Their math was based on an average of 8MB per book if it wasscanned at 600dpi and then compressed There are 26 million books inthe Library of Congress For more see Mike Ashenfelder ldquoTransferringlsquoLibraries of Congressrsquo of Datardquo Library of Congress Digital Preservation

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Blog 11 July 2011wwwblogslocgovdigitalpreservation201107transferring-libraries-of-congress-of-data ltviewed 11 January 2013gt

24 Gleick 232 25 ldquoAbout the Library of Congress Web Archivesrdquo Library of Congress Web

Archiving FAQs Undated wwwlocgovwebarchivingfaqhtmlfaqs_02ltviewed 11 January 2013gt

26 ldquo10000000000000000 Bytes Archivedrdquo Internet Archive Blog 26October 2012 wwwblogarchiveorg2012102610000000000000000-bytes-archivedltviewed 19 December 2012gt

27 Jean-Baptiste Michel et al 176ndash82 Published Online Ahead of Print16 December 2010 wwwsciencemagorgcontent3316014176ltviewed 29 July 2011gt See also Scott Cleland ldquoGooglersquos PiracyLiabilityrdquo Forbes 9 November 2011 wwwforbescomsitesscottcle-land20111109googles-piracy-liability ltviewed 11 January 2013gt

28 ldquoGovernment of Canada Web Archiverdquo Library and Archives Canada 17October 2007 wwwcollectionscanadagccawebarchivesindex-ehtmlltviewed 11 January 2013gt

29 For background see Bill Curry ldquoVisiting Library and Archives inOttawa Not Without an Appointmentrdquo Globe and Mail 1 May 2012wwwtheglobeandmailcomnewspoliticsottawa-notebookvisiting-library-and-archives-in-ottawa-not-without-an-appointmentarticle2418960 ltviewed 22 May 2012gt and the announcement ldquoLAC Begins Implementation of New Approach to service Deliveryrdquo Library and Archives Canada February 2012wwwcollectionscanadagccawhats-new013-560-ehtml ltviewed 22May 2012gt For professional responses see the ldquoCHArsquos comment onNew Directions for Library and Archives Canadardquo CHA Website March2012 wwwcha-shccaenNews_39items19html ltviewed 19December 2012gt and ldquoSave Library amp Archives Canadardquo CanadianAssociation of University Teachers launched AprilMay 2012 wwwsavelibraryarchivesca ltviewed 19 December 2012gt

30 Ian Milligan ldquoThe Smokescreen of lsquoModernizationrsquo at Library andArchives Canadardquo ActiveHistoryca 22 May 2012wwwactivehistoryca201205the-smokescreen-of-modernization-at-library-and-archives-canada ltviewed 19 December 2012gt

31 ldquoSave LAC May 2012 Campaign Updaterdquo SaveLibraryArchivesca May2012 wwwsavelibraryarchivescaupdate-2012-05aspx ltviewed 19December 2012gt

32 The challenges and opportunities presented by such projects is extensively discussed in Daniel J Cohen ldquoThe Future of Preserving the Pastrdquo CRM The Journal of Heritage Stewardship 2 no 2

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

(Summer 2005) 6ndash19 also available freely online at the RoyRosenzweig Center for History and New Mediawwwchnmgmueduessays-on-history-new-mediaessaysessayid=39For more see ldquoSeptember 11 Digital Archive Saving the Histories of September 11 2001rdquo Centre for History and New Media andAmerican Social History Project Center for Media and Learningwww911digitalarchiveorg ltviewed 11 January 2013gt ldquoHurricaneDigital Memory Bank Collecting and Preserving the Stories of Katrina and Ritardquo Center for History and New Media wwwhurricanearchiveorg ltviewed 11 January 2013gt and ldquoOccupyArchive Archiving the Occupy Movements from 2011rdquo Centre forHistory and New Media wwwoccupyarchiveorg ltviewed 11 January2013gt

33 Daniel Cohen and Roy Rosenzweig Digital History A Guide to Gathering Preserving and Presenting the Past on the Web(Philadelphia Temple University Press 2005) chapter on ldquoPreservingDigital Historyrdquo The chapter is available fully online atwwwchnmgmuedudigitalhistorypreserving5php ltviewed 19December 2012gt

34 For an overview see ldquoPreserving Our Digital Heritage The NationalDigital Information Infrastructure and Preservation Program 2010 Report a Collaborative Initiative of the Library of CongressrdquoDigitalpreservationgov wwwdigitalpreservationgovmultimediadocumentsNDIIPP2010Report_Postpdf ltviewed 19 December 2012gt

35 Steve Meloan ldquoNo Way to Run a Culturerdquo Wired 13 February 1998available online via the Internet Archivewwwwebarchiveorgweb20000619001705httpwwwwiredcomnewsculture012841030100html ltviewed 19 December 2012gt

36 Some of this has been discussed in greater depth on my website See theseries beginning with Ian Milligan ldquoWARC Files A Challenge forHistorians and Finding Needles in Haystacksrdquo ianmilliganca 12December 2012 wwwianmilliganca20121212warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks ltviewed 20 December2012gt

37 Unstructured data as opposed to structured data is important for histo-rians but brings additional challenges Some projects have producedstructured data novels marked up using XML encoding for example soone can immediately see place names dates character names and soforth This makes it much easier for a computer to read and understand

38 See William Noblett ldquoDigitization A Cautionary Talerdquo New Review ofAcademic Librarianship 17 (2011) 3 Tobias Blanke Michael Bryantand Mark Hedges ldquoOcropodium Open Source OCR for Small-ScaleHistorical Archivesrdquo Journal of Information Science 38 no 1 (2012) 83

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 18: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

Canadarsquos young people get the skills and experience they need to par-ticipate in the knowledge-based economy of the futurerdquo and thatCDC ldquoshows that young people facing significant employment chal-lenges can contribute to and learn from the resources available onthe Information Highwayrdquo41 The documentation reflecting thelower level of technical expertise and scarcity of computer hardwareat the time was comprehensive participants were encouraged toidentify a collection digitize it process it (by cleaning up the data)preserve it on a hard drive and display it on the web Issues of copy-right were discussed as well as preservation issues and hardwarerequirements for computers (a computer with an Intel 386 processor8MB RAM a 1GB hard drive and 288bps modem was recom-mended) and training information for employers educators andparticipants42 Sample proposals community liaison ideas amidstother information set out a very comprehensive roadmap for a grass-roots bottom-up approach to digital preservation and heritage

Moving through the preserved websites is an interesting snap-shot at how web design standards changed Colour disappearsreplaced by a more subdued coordinated federal website standard(still largely in effect today) Surveys greet viewers from 2003onwards trying to learn how viewers discovered the site whetherthey think it is a good educational resource (they must have beenrelieved that the majority thought so mdash in 2003 some 84 percentfelt that it was a ldquogoodrdquo resource the best of its type)43 By 2004 theprogram was engaging in active remembrance of its past programsas it neared its ten-year anniversary Proposals stopped beingaccepted in July 2004 but the 600 webpages were still accessiblewith a comprehensive subject index alphabetical listing success sto-ries and detailed copyright information

The Canadarsquos Digital Collections site began to move towardsremoval in November 2006 and was initially completely gone A visi-tor to the site would learn that the content ldquois no longer availablerdquo andan e-mail address was provided for contributors to get their informationback44The information was completely inaccessible and by early 2007a bold notice announced that the now-empty site would close as of 2April 2007 Luckily by April 2007 a notice appeared with a declara-tion that the collections were now archived by Library and Archives

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Canada just in time for the website to completely disappear by July2007 In the archived version subject indexes and headings were goneAll that remained then and today for that matter is a straight-forwardalphabetical list and a disclaimer that noted ldquo[y]ou are viewing a doc-ument archived by Library and Archives Canada Please noteinformation may be out of date and some functionality lostrdquo

The question now is what we can do with this collection ofwebsites Could a distant reading as opposed to a close reading ona targeted level tell us something about this collectionFortunately we have a variety of tools that can help us navigate thislarge array of information several of which I use to navigate largeamounts of data First I create my own tools through computerprogramming This work is showcased in the following section I program in Wolfram Researchrsquos Mathematica 9 Mathematica is anintegrated platform for technical computing that allows users toprocess visualize and interact with exceptional arrays of informa-tion45 It is a language with an engaged and generous user base thatcomplements exhaustive well-written documentation fromWolfram Research To emphasize its ease one line of code couldimport a website in perfectly readable plaintext (avoiding unicodeerrors) as follows input=Import[ldquohttpactivehis-torycaldquordquoPlaintextrdquo] and a second one could make itall in lower text for textual analysis as such lowerpage=ToLowerCase[input] Variables are a critical building blockIn the former example ldquoinputrdquo now holds the entire plaintext of thewebsite ActiveHistoryca and then ldquolowerpagerdquo holds it in lowercaseOnce we have lower-cased information we are quickly able to extractthe following frequent words frequent phrases and could thus cre-ate a dynamic database of n-grams word clouds or run other textualanalysis programs on the material Much of the syntax adheres to arecognizable English-language format

Even more promising is Mathematicarsquos integration with theWolfram|Alpha database (itself accessible at wolframalphacom) acomputational knowledge engine that promises to put incredibleamounts of information at a userrsquos fingertips With a quick com-mand one can access the temperature at Torontorsquos Lester B PearsonInternational Airport at 500 pm on 18 December 1983 the rela-

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

tive popularity of names such as ldquoEdithrdquo or ldquoEmmardquo the Canadianunemployment rate in March 1985 or the Canadian GrossDomestic Product in December 199546 This allows one to quicklycross-reference findings with statistical information or find coinci-dental points of comparison As universities are now being activelyencouraged to contribute their information to Wolfram|Alpha thiswill become an increasingly powerful database

For many researchers the downside of Mathematica is a con-siderable one It is proprietary software and relatively expensive forstudent and non-student alike Outputs however can be dissemi-nated using the Computational Document Format (CDF) enablinghistorians to put their work online and entice the general public todynamically work with their models and findings

Programming is not the be all and end all Information visual-ization in the humanities is a growing international field withCanadian scholars heavily involved The most important and acces-sible example is Voyant Tools47 It is a powerful suite of textualanalysis tools Uploading onersquos own text or collection of texts draw-ing on a webpage or analyzing existing bodies of work one sees thefollowing a ldquoword cloudrdquo of a corpus click on any word to see itsrelative rise and fall over time (word trends) access the original textand click on any word to see its contextual placement within indi-vidual sentences It ably facilitates the ldquodistancerdquo reading of anyreasonably-sized body of textual information and should be a begin-ning point for those interested in this field The only downside isthat Voyant can choke on very large amounts of information as it isa textual analysis rather than robust data mining tool

There are several other complementary tools A notable one is theMany Eyes website hosted by IBM Research Many Eyes allows you toupload datasets be they textual tabular or numerical There are manyvisualization options provided you can plot relationships betweendata compare values (ie bar charts and bubble charts) track dataover time and apply information to maps The only downside toMany Eyes is that you must register and make your datasets available toall other registered researchers as well as IBM This limits your abilityto use copyrighted or proprietary information

More advanced technical users can avail themselves of a host of

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

textual analysis programs found elsewhere on the web The mostcomprehensive gathering of these tools can be found with theSoftware Environment for the Advancement of Scholarly Research(SEASR) and its Meandre workbench Meandre which can beinstalled as a local server on your system or mdash in a best case scenariomdash hosted by an institutional IT department allows you to tap intoa variety of workflows to analyze data48 One workflow mentionedin this article is the topic modeling toolkit found within the MachineLearning for Language Toolkit (MALLET) package which clustersunstructured data into ldquotopicsrdquo which appear together giving a quicksense of a documentrsquos structure49 Other workflows found withinSEASR include Dunningrsquos log-likelihood statistic which compares aselected document to a broader reference corpus For example youcould compare the lyrics that appear in 1968 against all lyrics in thepostwar era and learn what makes 1968 most unique50 sentimentanalyses which attempt to plot changes in emotional words andphrases over time and a variety of entity extraction algorithms(which find ldquopeoplerdquo ldquoplacesrdquo and so forth in a text) Corpuses thatwere previously too big to extract meaningful information from cannow be explored relatively quickly

As befits a technical side to history all of the freely availabletools discussed briefly here (notably wget SEASR and MALLET)are discussed in more technical terms in the peer-reviewedProgramming Historian 2 or on my own blog at ianmilliganca I haveelected to leave them online rather than provide commands in thisarticle they require some technical knowledge and directly workingfrom the web is quicker than here Please consider the following sec-tion an applied version of some of the lessons you can learn throughthese processes

From Web Collection to Dynamic Finding Aids

Let us briefly return to our archival metaphor Todayrsquos historiandealing with conventional sources often has to physically travel to anarchive requisition boxes perhaps take photographs of materials iftime is tight and otherwise take notes With born-digital sourceshowever the formula is reversed Our sources can come to us In a

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

few keystrokes we can bring an unparalleled quantity of informationto our home or work computers With Canadarsquos Digital Collectionsa single command line entry brought the entire archive onto my sys-tem it took days to process the information and I had to leave mycomputer on but it did not require any active user engagement

How big is the CDC collection in terms of sheer informationThe plaintext without any HTML markup is approximately7805135 words If each page contained 250 words it would be theequivalent of a 31220 page essay or 1000 dissertations However itis even more complicated than just reading 1000 dissertations Theinformation is spread across 73585 separate HTML files themselvespart of a much broader corpus of 278785 files Even the sheer sizeof the database requires big data methodologies an aggregate size of726 GB 360 MB of plaintext with HTML encoding reduced to alsquomerersquo 50MB if all encoding is removed A 50MB plain text docu-ment can quickly slow down even a relatively powerful personalcomputer There is no simple way to make this a conventionallyaccessible amount of information

How can we access this information Several different methodsare possible First we can work with the information on the internetitself or we can download it onto our own system I prefer the lattermethod as it decreases computational time adds flexibility andenables free rein to experiment with material Before you do thisthough you need to be familiar with the websitersquos ldquoterms of servicerdquoor the legal documentation that outlines how users are allowed to usea website resources or software In this case LAC provides generousnon-commercial reproduction terms information ldquomay be repro-duced in part or in whole and by any means without charge orfurther permission unless otherwise specifiedrdquo51

The CDC web archive index is organized like so the index islocated at wwwepelac-bacgcca100205301iccdcEAlphabetaspwith every subsequent website branching off the cdc folder Forexample one website would be hosted in wwwepelac-bacgcca100205301iccdcABbenevolat the next in hellipcdcabnatureand so on Using wget a free software program you can downloadall websites in a certain structure the program recursively downloadsinformation in a methodical fashion I first used wget to save the

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

cookies (little files kept on your system in this case one that notedthat you had seen the disclaimer about the archival nature of thewebsite) and then with an one-line command in my command lineto download the entire database52 Due to the imposition of a ran-dom wait and bandwidth limit which meant that I would downloadone file every half to one and a half seconds at a maximum of 30kilobytes a second the process took six days In the ProgrammingHistorian 2 I have an article expanding upon wget for humanitiesresearchers

Hypertext itself deserves a brief discussion The basic buildingblock of the web is the link one page is connected to another andpeople navigate through these webs While books and articles forexample are connected to each other by virtue of ideas and concretecitations these must often be inferred online HTML born-digitalsources are different in that the link offers a very concrete file One

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Visualizing the CDC Collection as a Series of Hyperlinks

quick visualization the circular cluster seen above lets us see theunique nature of this information We can go to each web pagescrape the links and see where they take viewers As we see in thebelow figure the CDC collection can be visualized as a series ofhyperlinks (each line) The cluster in the middle is the CDC collec-tion itself whereas the strands on the outside represent externalwebsites Using Mathematica I have generated a dynamic modelyou can move your mouse over each line learn what link it is andthus see where the collection fits into the broader Internet The siteis limited to following two links beyond the first page Through thisapproach we see the interconnection of knowledge and relationshipsbetween different sources At a glimpse we can see suggestions to fol-low other sites and determine if any nodes are shared amongst thevarious websites revealed

Visualization Moving Beyond Word Clouds

Where to begin with a large collection such as the Canadarsquos DigitalCollection corpus As a historian who is primarily interested in thewritten word my first step was to isolate the plaintext throughout thematerial so that I could begin rudimentary data mining and textualanalysis on the corpus For this I approached the CDC collection ontwo distinct levels mdash first the aggregate collection and second divid-ing the collection into each lsquofilersquo or individual website mdash derivingword and phrase data for each Imagine if you could look throughinto an archival collection and get a quick sense of what topics andideas were covered Taking that thought further imagine the utility ifyou could do this with individual archival boxes and even files Witha plethora of born-digital sources this is now a necessary step

On an aggregate level word and phrase data is a way to explorethe information within How often do various words within thenearly eight million words of the collection appear UsingMathematica we can quickly generate a list of words sorted by theirrelative frequency we learn that the word ldquoCanadardquo appears 26429times ldquocollectionsrdquo 11436 and ldquo[L]ouisbourgrdquo an astounding9365 times Reading such a list can quickly wear out a reader and isof minimal visual attractiveness Two quick decisions were made to

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

enhance data usability first the information was ldquocase-foldedrdquo allsent to lower-case this enables us to count the total appearances of aword such as ldquodigitalrdquo regardless of whether or not it appears at thebeginning of a sentence

Second when dealing with word frequency mdash as opposed tophrase frequency mdash ldquostop wordsrdquo should be removed as they werehere These are ldquoextremely common words which would appear to beof little value in helping select documents matching a user needrdquowords such as ldquotherdquo ldquoandrdquo ldquotordquo and so forth They matter forphrases (the example ldquoto be or not to berdquo is comprised entirely ofstop words) but will quickly dominate any word-level visualizationor list53 Other words may need to be added or removed from stoplists depending on the corpus for example if the phrase ldquopage num-berrdquo appeared on every page that would be useless information forthe researcher

The first step then is to use Mathematica to format word fre-quency information into Wordlenet readable format (if we put allof the text into that website it will crash mdash so we need to followthe instructions on that website) we can then draw up a wordcloud of the entire collection Here we can see both the advantagesand disadvantages of static word clouds What can we learn from this

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A Word Cloud of All Words in the CDC Corpus

illustration First without knowing anything about the sites we canimmediately learn that it is dealing with CanadaCanadians that itis a collection with documents in French and English that we aredealing with digitized collections and that there is a fair amount ofregional representation not based upon population (note thatldquoontariordquo is less frequent than both ldquonewfoundlandrdquo ldquonova scotiardquoand ldquoalbertardquo although ldquoqueacutebecrdquo and ldquoquebecrdquo is not terribly promi-nent) Certain kinds of history dominate communities familynames the built form (houses churches schools and so forth) At avery macro level we can learn several characteristics before doing anydeeper investigation

These forms of visualization need to be used with extreme cau-tion however First while visually attractive it leaves a severeamount of ambiguity are ldquoontariordquo and ldquoqueacutebecrdquo equally promi-nent for example or is there a slight divergence Even moreproblematically we cannot learn context from the above picturedoes the word ldquoschoolrdquo refer to built structures or educational expe-riences The same question with applies to ldquochurchrdquo Do mentionsof ldquonewfoundlandrdquo refer to historical experiences there or withexpatriates or in travel narratives Word clouds are useful first stepsand have their place but need to be used with extreme caution Formore scholarly applications tables offer much more precise inter-pretation As the second figure below demonstrates they can bemore helpful when dealing with the individual sites cuttingthrough ambiguity The figure below is a visualization of the web-site ldquopast and presentrdquo Despite the ambiguous title we can quicklylearn what this website contains from this visualization Wordle isthus a useful first step analogous to taking magically taking a quicklook through the side of an archival box We still need more infor-mation however to really make this form of textual analysishelpful To navigate this problem I have created my own dynamicfinding aid program using Mathematica which scales to any born-digital web collection

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Two approaches were taken First a textual representation of wordand phrase frequency allows us to determine the content of a given filerelatively quickly I wrote a short program to extract word frequency(stop words excluded) and phrase frequency For phrases we speak ofn-grams bigrams trigrams quadgrams and fivegrams although onecould theoretically do any higher number A bigram can be a combina-tion of two characters such as lsquobirsquo two syllables or mdash pertinent in ourcase mdash two words An example ldquocanada isrdquo A trigram is three wordsquadgram is four words and a fivegram is five words and so on

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Word Cloud of the File ldquoPast and Presentrdquo

A viewer for n-grams in the CDC Collection

The CDC collection demonstrates this approachrsquos utility Usingmy Mathematica textual CDC browser pictured above I randomlyexplored file 267 the ambiguously named ldquopast and presentrdquo web-site This is one of the largest individual websites 122363 words orabout 493 standard typed pages Such an ambiguous name could beabout almost anything Initially I decided to study word frequency

ldquoldquoalbertardquo 758rdquo ldquoldquolistenrdquo 490rdquo ldquoldquosettlersrdquo 467rdquoldquoldquonewrdquo 454rdquo ldquoldquocanadardquo 388rdquo ldquoldquolandrdquo 362rdquo ldquoldquoreadrdquo328rdquo ldquoldquopeoplerdquo 295rdquo ldquoldquosettlementrdquo 276rdquo ldquoldquochineserdquo268rdquo ldquoldquocanadianrdquo 264rdquo ldquoldquoplacerdquo 255rdquo ldquoldquoheritagerdquo254rdquo ldquoldquoedmontonrdquo 250rdquo ldquoldquoyearsrdquo 238rdquo ldquoldquowestrdquo 236rdquoldquoldquoraymondrdquo 218rdquo ldquoldquoukrainianrdquo 216rdquo ldquoldquocamerdquo 213rdquoldquoldquocalgaryrdquo 209rdquo ldquoldquonamesrdquo 206rdquo ldquoldquoranchrdquo 204rdquoldquoldquoupdatedrdquo 193rdquo ldquoldquohistoryrdquo 190rdquo ldquoldquocommunitiesrdquo183rdquo hellip

Rather than having to visit the website at a glance we can almostimmediately see what the site is about More topics become apparentwhen one looks at trigrams for example in this automatically gen-erated output of generated top phrases

ldquoldquofirstrdquo ldquopeoplerdquo ldquosettlersrdquo 100rdquo ldquoldquotherdquo ldquounitedrdquoldquostatesrdquo 72rdquo ldquoldquoonerdquo ldquoofrdquo ldquotherdquo 59rdquo ldquoldquoofrdquo ldquotherdquoldquowestrdquo 58rdquo ldquoldquosettlersrdquo ldquolastrdquo ldquoupdatedrdquo 56rdquo ldquoldquopeoplerdquoldquosettlersrdquo ldquolastrdquo 56rdquo ldquoldquoalbertansrdquo ldquofirstrdquo ldquopeoplerdquo 56rdquoldquoldquoadventurousrdquo ldquoalbertansrdquo ldquofirstrdquo 56rdquo ldquoldquotherdquo ldquobarrdquoldquourdquo 55rdquo ldquoldquoopeningrdquo ldquoofrdquo ldquotherdquo 54rdquo ldquoldquocommunitiesrdquoldquoadventurousrdquo ldquoalbertansrdquo 54rdquo ldquoldquonewrdquo ldquocommunitiesrdquoldquoadventurousrdquo 54rdquo ldquoldquobarrdquo ldquourdquo ldquoranchrdquo 53rdquo ldquoldquolistenrdquoldquotordquo ldquotherdquo 52rdquo hellip

Through a few clicks we can navigate and determine the basiccontours mdash from a distance mdash of what the site pertains to In aviewer you can use a slider to dynamically see more informationor less information quickly toggling between bigrams trigramsand quadgrams allowing you to quickly hone in on immediately

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

relevant information This can be useful but we need to moveback even further to achieve a distant reading of this born-digitalcollection

If one zooms back out to this very distant level we can extractn-gram data from the entire collection We have two approachesfirst aggregate appearances of a given term (if ldquoaboriginalrdquo appears457 times then n=457) or second relative appearances of a giventerm (if ldquoaboriginalrdquo appears in 45 of 384 sites then n=45384 or01171875) In my other work on music lyrics the latter optionenables you to control for the lsquochorusrsquo effect as the former can be drowned out by one song with repetitive words Given thenature of this collection I elected for the former approach Whatcan we learn from trigrams for example After an initial set of trigrams pertaining to the Louisbourg digitization project whichhas boilerplate language on each page as well as common ldquostop-wordrdquo equivalents of phrases (especially pronounced in French) we come across the first set of meaningful phrases excerptedbelow

ldquoexperiencerdquo ldquoinrdquo ldquotherdquo 919 ldquotransferredrdquo ldquotordquo ldquotherdquo918 ldquofundedrdquo ldquobyrdquo ldquotherdquo 918 ldquoofrdquo ldquotherdquo ldquoprovincialrdquo915 ldquothisrdquo ldquowebrdquo ldquositerdquo 915 ldquo409rdquo ldquoregistryrdquo ldquo2rdquo914 ldquoarerdquo ldquonordquo ldquolongerrdquo 914 ldquogouvernementrdquo ldquodurdquo ldquocanadardquo 910 ldquouserdquo ldquothisrdquo ldquolinkrdquo 887ldquopagerdquo ldquoorrdquo ldquoyourdquo 887 ldquofollowingrdquo ldquopagerdquo ldquoorrdquo 887 ldquoautomaticallyrdquo ldquotransferredrdquo ldquotordquo 887 ldquoberdquoldquoautomaticallyrdquo ldquotransferredrdquo 887 ldquowillrdquo ldquoberdquo ldquoautomaticallyrdquo 887 ldquoindustryrdquo ldquoyourdquo ldquowillrdquo 887ldquomultimediardquo ldquoindustryrdquo ldquoyourdquo 887 ldquopracticalrdquo ldquoexperiencerdquo ldquoinrdquo 887 ldquogainedrdquo ldquopracticalrdquo ldquoexperiencerdquo887 ldquotheyrdquo ldquogainedrdquo ldquopracticalrdquo 887 ldquowhilerdquo ldquotheyrdquoldquogainedrdquo 887 ldquocanadiansrdquo ldquowhilerdquo ldquotheyrdquo 887ldquoyoungrdquo ldquocanadiansrdquo ldquowhilerdquo 887 ldquoshowcaserdquo ldquoofrdquoldquoworkrdquo 887 ldquoexcellentrdquo ldquoshowcaserdquo ldquoofrdquo 887 hellip

Here we see an array of common language funding informa-tion link information project information spread across over 887

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

websites Reading down we see information about governments jobexperience project information initiatives re-structuring and soforth At a glance we can see what language is common to all web-sites as opposed to unique information Moving down we find themore common but less universal ldquothe history ofrdquo for example ldquoher-itage community foundationrdquo libraries and so forth By the end ofthe list we can move into the unique or very rare phrases that appearonly once twice or three times

Moving beyond word and phrase frequency there are a fewmore sophisticated off-the-shelf options that can help us make senseof large arrays of unstructured data Topic modeling is among themost developed as Princetonrsquos David M Blei (a Computer Scienceprofessor) describes in a general introduction to the topic

While more and more texts are available online we simplydo not have the human power to read and study them toprovide the kind of [traditional] browsing experiencedescribed above To this end machine learning researchershave developed probabilistic topic modeling a suite of algo-rithms that aim to discover and annotate large archives ofdocuments with thematic information Topic modelingalgorithms are statistical methods that analyze the words ofthe original texts to discover the themes that run throughthem how those themes are connected to each other andhow they change over time hellip Topic modeling enables usto organize and summarize electronic archives at a scalethat would be impossible by human annotation55

Using the aforementioned SEASR Meandre workbench (or viaMALLET itself and a command-line interface) we can quickly beginto run our born-digital information through the topic-modelingalgorithm The results are notable and come in two formats a wordcloud visualization as well as an XML file containing a line-by-linebreakdown of each topic Taking the aforementioned file 267 (ldquopastand presentrdquo) we can begin to get a sense of the topics this large web-site discusses

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

At a glance we can learn about some of the main themes the struc-ture of Alberta (Alberta towns bridges lakes) the geographic shapeof the area (life land lands areas links grass and interestingly lan-guage) the stories (story knight archives but also interestinglyknight and wife) and so forth The word cloud gives us a sense ofthe most relevant terms that define each topic but we also have a3200-line XML file broken into topics which lets us further refineour inquiry

A last set of examples will help us move from the aggregate levelof hundreds of thousands of lines of text to the individual line inessence creating our own search engine to navigate this collection

There are a few good reasons why we should not use Google todo all of this in our own work First we do not know how Googleworks precisely It is a proprietary black box While the original algo-rithm is available the actual methods that go into determining theranking of search terms is a trade secret Second Google is optimizedfor a number of uses mainly to deliver the most relevant informa-tion to the user Historians and archivists are looking often forspecific information within old websites and it behooves us to exam-ine them through the lens of more specialized tools With thesehesitations in mind I have sought to develop my own rudimentarysearch engine for this born-digital collection

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A two-pronged approach can take us to relevant informationquickly In the following example I am carrying out a project onblack history We quickly search all of the CDC word frequency filesfor the occurrence of the word ldquofreedomrdquo ldquoblackrdquo and ldquolibertyrdquo(these terms of course require some background knowledge thesearch is shaped by the user) obtain a list and then map it onto achart in order to quickly visualize their locations in the collection Forexample in the graph above we see a visualization of a given term inthis case our three search terms As one moves the mouse over eachbar the frequency data for a given term appears We can also zoomand enhance this graph to use it as we wish A demonstration of thiscan be found online at wwwianmilliganca20130118cdc-tool-1

We can then move down into the data In the example aboveldquo047rdquo labeled ldquoblackgoldtxtrdquo contains 31 occurrences of the wordldquoontariordquo Moving to my Mathematica keyword-in-context routinewe can see where these occurrences of the word appear

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

A snapshot of the dynamic finding aid tool

We see that this website focuses mostly on southwestern Ontariowhere many oil fields are If we move forward we can see that it per-tains to the nineteenth century a particular region as well assubsequent memorialization activities (mention of a museum forexample) All this can be done from within the textual client asopposed to going out into individual websites Broad contoursemerge from this distant reading but they also facilitate deeper read-ing by allowing us to promptly realize critical sources

Another model through which we can visualize documents isthrough a viewer primarily focusing on term document frequencyand inverse frequency terms discussed below To prepare this mater-ial we are combining our earlier data about word frequency (that theword ldquoexperiencerdquo appears 919 times perhaps) with aggregate infor-mation about frequency in the entire corpus (ie the wordldquoexperiencerdquo appears 1766 times) The number of times that a termappears in a given website can be defined as tf (term frequency)Another opportunity is inverse document frequency where we canlearn what words are most distinct and unique in a given documentThe formula for this is the oft-cited tf-idf a staple of informationretrieval theory or tf times the logarithm of the number of docu-ments divided into the number of documents where the given termappears At a glance then we can see some of the unique words con-tained within a website if we select ldquoinverse document frequencyrdquo

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

At a very quick glance one can see the major words that make this aunique document This technique can be used on any textual docu-ment

Off-the-shelf tools can facilitate a deeper yet distant reading ofindividual websites as well especially the suite of research tools onwebsites such as Many Eyes and Voyant Using these tools howeverrequires that one puts individual data into each separate programrather than quickly navigating between them This stems from themassive amount of files and data involved Many Eyes limits files to5MB and Voyant has no hard and fast rule but can struggle withlarger files

This analysis was carried out on my personal computer55 Largerdata sets may necessitate a corresponding increase in computationalpower bringing social sciences and humanities researchers into con-tact with High Performance Computing (HPC) centres or consortiaThese extremely powerful computers provide researchers with com-puting power orders of magnitude above that of personal computersallowing results in hours instead of the weeks or months it might

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

take to process on a smaller system56 This has particular applicationin the basic processing of a very large web archive There are limita-tions however jobs must be submitted into queues in advancerather than on-site processing and perhaps most importantly HPCcentres prefer to run highly optimized code to minimize usage oftheir very expensive and scarcely allocated equipment As this level ofcoding expertise is often beyond humanities researchers researchteams must be formed

Humanities researchers would do well to familiarize themselveswith local HPC options many of which are more than happy towork on new and engaging questions These are accessible optionsin Ontario for example all but one university belongs to a HPCconsortium57 Yet as personal computing continues to dramaticallyincrease in speed and as storage prices continue to plummet textualanalysis is increasingly feasible on high-quality office systems An ini-tial exposure can generate questions and problems that if the needrears itself can subsequently be taken to the supercomputer

This case study is intended to provide one model for distantreading and the generation of finding aids in the interest of showingsome of the potential for working with born-digital sources It isintended to begin or jumpstart a conversation around the resourcesand skills necessary for the next generation A seemingly insur-mountable amount of information contained across over 73000files (as in the CDC collection) can be quickly navigated and iso-lated for researcher convenience

Conclusion Digital Literacy for the Next Generation ofHistorians

As a discipline I would like to suggest one path forward prepare thenext generation of historians at least some of them to deal with thedeluge of born-digital sources Fundamentally the challenge of born-digital sources is a pedagogical one For existing historians this is anissue of continual education for future historians an issue of profes-sional training (the annual Digital Humanities Summer Instituteheld at the University of Victoria each June is worth considering asan introduction to many digital concepts) Some off-the-shelf tools

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

can serve as gateways to deeper engagement notably the VoyantTools suite (wwwvoyant-toolsorg) and perhaps subsequently thelessons contained in the Programming Historian 2 That said digitalliteracy courses should become part of the essential pantheon of his-torical curricula This should not be mandatory but instead taughtalongside and within traditional historiography and research meth-ods courses In an era of diminishing resources and attention paid toknowledge translation and relevancy digital literacy courses are alsowell positioned to provide graduates with concrete portfolios andactionable skills If we want to be prepared to use digital method-ologies to study the post-1990 world we will need ready historians

Such a course should have six basic priorities First we must pro-vide historiographical context for new students on whatcomputational methods have been tried before How have quantita-tive and qualitative methods differed yet ultimately complementedeach other Second students in this course would need to develop anunderstanding of the digital basics cloud computing effectivebackup and versioning digital security and an understanding of howthe Internet is organized Third for effective cross-departmental andsustained engagement with digital history students must to learn howto digitally navigate conventional sources notable skill sets wouldinclude a familiarity with citation management software databasesand the possibility of using optical character recognition on a varietyof born-paper sources subsequently digitized through scanners orcameras This must also include a grasp of the scope and significanceof large digital depositories including Google Books the InternetArchive and others mentioned earlier Fourth students must learnhow to process large amounts of digital information this includes off-the-shelf visualization tools material discussed here and some of thephilosophical and methodological implications of a distance readingFifth students can learn basic programming through the ProgrammingHistorian Finally and perhaps most importantly students need anunderstanding of various ways to openly disseminate material on the web whether through blogging micro-blogging publishing webpages and so forth it is important to enable the great work that we do in the university to make its mark on the world Throughthis students will also come away from history departments with

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

comprehensive and marketable digital portfolios Such a coursewould focus more on homework than reading with special emphasisplaced on deliverable content such as websites interactive modelsand broadly implementable databases

Digital history and the readiness to work with born-digitalsources cannot stem out of pedagogy alone however Institutionalsupport is necessary While based on anecdotal evidence it appearsthat some university administrators have begun to embrace the shifttowards the digital humanities At the federal level SSHRC has iden-tified the Digital Economy as one of its priority areas58 Yet when itcomes to evaluation doctoral and post-doctoral researchers do nothave a ldquodigital humanitiesrdquo option open to them in terms of assessingcommittees leading to fears that their projects might be lost inbetween disciplinary areas of focus59 In light of the impending crisisof born-digital sources historians should encourage these initiatives

These sorts of digital projects do not always lend themselveswell to traditional publication venues The examples above from theCDC case study for example are best viewed as dynamically mal-leable models this will be one challenge as historians grapple withnew publication models This article argues first why digitalmethodologies are necessary and second how we can realize aspectsof this call The Internet is 20 years old and before historians realizeit we will be faced with the crisis of born-digital information over-load If history is to continue as the leading discipline inunderstanding the social and cultural past decisive movementtowards the digital humanities is necessary When the sea comes inwe must be ready to swim

IAN MILLIGAN is an assistant professor of Canadian history at theUniversity of Waterloo He is a founding co-editor of ActiveHistorycaan editor-at-large of the Programming Historian 2 and has written sev-eral articles in the fields of Canadian youth digital and labour history

IAN MILLIGAN est professeur adjoint en histoire canadienne agrave laUniversity of Waterloo Il est le fondateur et codirecteur du site

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Internet ActiveHistoryca ainsi qursquoun eacutediteur du site ProgrammingHistorian 2 Il a eacutecrit plusieurs articles au sujet de la jeunesse cana-dienne de la numeacuterisation et de lrsquohistoire des travailleurs

Endnotes

1 The question of when the past becomes a ldquohistoryrdquo to be professionallystudied is a tricky one Yet it is worth noting that the still-definitivework on the Baby Boom Generation appeared in 1996 with DougOwram Born at the Right Time A History of the Baby Boom Generationin Canada (Toronto University of Toronto Press 1996) This actuallyfollowed the earlier scholarly monograph Cyril Levitt Children ofPrivilege Student Revolt in the Sixties (Toronto University of TorontoPress 1984) and a host of other laudable works As academicdebatesworks on the 1960s now suggest the past becomes historysooner than we often expect

2 A critical point advanced in the field of literary criticism and theory byStephen Ramsay Reading Machines Towards an Algorithmic Criticism(Urbana University of Illinois Press 2011)

3 See ldquoIdleNoMore Timelinerdquo ongoingwwwidlenomoremakookcatimeline44 ltviewed 18 January 2013gt

4 These issues have been discussed elsewhere albeit in either a moreabridged form or with a focus on digitized historical sources A notableexception is Daniel J Cohen and Roy Rosenzweig Digital History AGuide to Gathering Preserving and Presenting the Past on the Web(Philadelphia University of Pennsylvania Press 2005) specifically theirchapter on ldquoCollecting History onlinerdquo Their focus is on the preservation-ist and archival challenges however Important notes are also found inDan Cohen ldquoHistory and the Second Decade of the Webrdquo RethinkingHistory 8 no 2 (June 2004) 293ndash301 also available online atwwwchnmgmueduessays-on-history-new-mediaessaysessayid=34

5 Figure taken from Jose Martinez ldquoTwitter CEO Dick Costolo Reveals Staggering Number of Tweets Per Dayrdquo ComplexTech wwwcomplexcomtech201210twitter-ceo-dick-costolo-reveals-staggering-number-of-tweets-per-day ltviewed 6 January 2013gtFurther more while a tweet itself is only 140 characters every tweetcomes with over 40 lines of metadata date information author biogra-phy and location information when the account was created the usersrsquofollowing and follower numbers country data and so forth This is aremarkable treasure trove of information for future historians but also apotential source of information overload See Sarah Perez ldquoThis is What

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

a Tweet Looks Likerdquo ReadWriteWeb (19 April 2010) wwwreadwritewebcomarchivesthis_is_what_a_tweet_looks_likephpltviewed 6 January 2013gt

6 This is further discussed in Ian Foster ldquoHow Computation ChangesResearchrdquo in Switching Codes Thinking Through Digital Technology inthe Humanities and the Arts eds Thomas Bartscherer and RoderickCoover (Chicago University of Chicago Press 2011) esp 22ndash4

7 Franco Moretti Graphs Maps Trees Abstract Models for Literary History(New York Verso 2005) 3ndash4

8 While not of prime focus here the longue dureacutee deserves definition Itstands opposite the history of events of instances covering instead avery long time span in an interdisciplinary social science framework In along essay Braudel noted that one can then see both structural crises ofeconomies and structures that constrain human society and develop-ment For more see Fernand Braudel ldquoHistory and the Social SciencesThe Longue Dureacuteerdquo in Fernand Braudel On History trans SarahMatthews (Chicago University of Chicago Press 1980) 27 The essaywas originally printed in the Annales ESC no 4 (OctoberndashDecember1958)

9 A great overview can be found at ldquoText Analysisrdquo Tooling Up for DigitalHumanities 2011 wwwtoolingupstanfordedupage_id=981 ltviewed19 December 2012gt

10 Jean-Baptiste Michel Yuan Kui Shen Aviva Presser Aiden Adrian VeresMatthew K Gray The Google Books Team Joseph P Pickett DaleHoiberg Dan Clancy Peter Norvig Jon Orwant Steven Pinker MartinA Nowak Erez Lieberman Aiden ldquoQuantitative Analysis of CultureUsing Millions of Digitized Booksrdquo Science 331 (14 January 2011)176ndash82 The tool is publicly accessible at ldquoGoogle Books NgramViewerrdquo Google Books 2010 wwwbooksgooglecomngrams ltviewed11 January 2013gt

11 Anthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

12 Comment by Jean-Baptiste Michel and Erez Lieberman Aiden onAnthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

13 For more information see the Canadian Century Research InfrastructureProject website at wwwccriuottawacaCCRIHomehtml and the Programme de recherche en deacutemographie historique at wwwgeneologyumontrealcafrleprdhhtm

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

14 ldquoWith Criminal Intentrdquo CriminalIntentorg wwwcriminalintentorgltviewed 20 December 2012gt

15 For an overview see ldquoAward Recipients mdash 2011rdquo Digging into DataWebpage wwwdiggingintodataorgHomeAwardRecipients2011tabid185Defaultaspx ltviewed 20 December 2012gt

16 One can do no better than the completely free Programming Historianwhich uses entirely free and open-source programming tools to learn thebasics of computational history See The Programming Historian 2 2012-Present wwwprogramminghistorianorg ltviewed 11 January 2013gt

17 Michael Katz The People of Hamilton Canada West Family and Class ina Mid-Nineteenth-Century City (Cambridge MA Harvard UniversityPress 1975) Also see A Gordon Darroch and Michael D OrnsteinldquoEthnicity and Occupational Structure in Canada in 1871 The VerticalMosaic in Historical Perspectiverdquo Canadian Historical Review 61 no 3(1980) 305ndash33

18 Ian Anderson ldquoHistory and Computingrdquo Making History The ChangingFace of the Profession in Britain 2008 wwwhistoryacukmakinghistoryresourcesarticleshistory_and_computinghtml ltviewed 20 December2012gt

19 Christopher Collins ldquoText and Document Visualizationrdquo TheHumanities and Technology Camp Greater Toronto Area VisualizationBootcamp University of Ontario Institute of Technology Oshawa Ont21 October 2011

20 The quotations by Luther and Poe are commonly used to make thispoint For the best example see Clay Shirky Cognitive SurplusCreativity and Generosity in a Connected Age Google eBook Penguin2010 For Mumford see Lewis Mumford The Myth of the Machine vol2 The Pentagon of Power (New York Harcourt Brace 1970) 182 asquoted in Gleick The Information A History A Theory A Flood (NewYork Pantheon 2011) 404

21 Adam Ostrow ldquoCisco CRS-3 Download Library of Congress in JustOver One Secondrdquo Mashable (9 March 2010) accessed onlinewwwmashablecom20100309cisco-crs-3

22 A note on data size is necessary The basic unit of information is a ldquobyterdquo (8 bits) which is roughly a single character A kilobyte (KB) is athousand bytes a megabyte (MB) a million a gigabyte (GB) a billion aterabyte (TB) a trillion and a petabyte (PB) a quadrillion Beyond thisthere are exabytes (EB) zettabytes (ZB) and yottabytes (YB)

23 These figures can of course vary depending on compression methodschosen Their math was based on an average of 8MB per book if it wasscanned at 600dpi and then compressed There are 26 million books inthe Library of Congress For more see Mike Ashenfelder ldquoTransferringlsquoLibraries of Congressrsquo of Datardquo Library of Congress Digital Preservation

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Blog 11 July 2011wwwblogslocgovdigitalpreservation201107transferring-libraries-of-congress-of-data ltviewed 11 January 2013gt

24 Gleick 232 25 ldquoAbout the Library of Congress Web Archivesrdquo Library of Congress Web

Archiving FAQs Undated wwwlocgovwebarchivingfaqhtmlfaqs_02ltviewed 11 January 2013gt

26 ldquo10000000000000000 Bytes Archivedrdquo Internet Archive Blog 26October 2012 wwwblogarchiveorg2012102610000000000000000-bytes-archivedltviewed 19 December 2012gt

27 Jean-Baptiste Michel et al 176ndash82 Published Online Ahead of Print16 December 2010 wwwsciencemagorgcontent3316014176ltviewed 29 July 2011gt See also Scott Cleland ldquoGooglersquos PiracyLiabilityrdquo Forbes 9 November 2011 wwwforbescomsitesscottcle-land20111109googles-piracy-liability ltviewed 11 January 2013gt

28 ldquoGovernment of Canada Web Archiverdquo Library and Archives Canada 17October 2007 wwwcollectionscanadagccawebarchivesindex-ehtmlltviewed 11 January 2013gt

29 For background see Bill Curry ldquoVisiting Library and Archives inOttawa Not Without an Appointmentrdquo Globe and Mail 1 May 2012wwwtheglobeandmailcomnewspoliticsottawa-notebookvisiting-library-and-archives-in-ottawa-not-without-an-appointmentarticle2418960 ltviewed 22 May 2012gt and the announcement ldquoLAC Begins Implementation of New Approach to service Deliveryrdquo Library and Archives Canada February 2012wwwcollectionscanadagccawhats-new013-560-ehtml ltviewed 22May 2012gt For professional responses see the ldquoCHArsquos comment onNew Directions for Library and Archives Canadardquo CHA Website March2012 wwwcha-shccaenNews_39items19html ltviewed 19December 2012gt and ldquoSave Library amp Archives Canadardquo CanadianAssociation of University Teachers launched AprilMay 2012 wwwsavelibraryarchivesca ltviewed 19 December 2012gt

30 Ian Milligan ldquoThe Smokescreen of lsquoModernizationrsquo at Library andArchives Canadardquo ActiveHistoryca 22 May 2012wwwactivehistoryca201205the-smokescreen-of-modernization-at-library-and-archives-canada ltviewed 19 December 2012gt

31 ldquoSave LAC May 2012 Campaign Updaterdquo SaveLibraryArchivesca May2012 wwwsavelibraryarchivescaupdate-2012-05aspx ltviewed 19December 2012gt

32 The challenges and opportunities presented by such projects is extensively discussed in Daniel J Cohen ldquoThe Future of Preserving the Pastrdquo CRM The Journal of Heritage Stewardship 2 no 2

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

(Summer 2005) 6ndash19 also available freely online at the RoyRosenzweig Center for History and New Mediawwwchnmgmueduessays-on-history-new-mediaessaysessayid=39For more see ldquoSeptember 11 Digital Archive Saving the Histories of September 11 2001rdquo Centre for History and New Media andAmerican Social History Project Center for Media and Learningwww911digitalarchiveorg ltviewed 11 January 2013gt ldquoHurricaneDigital Memory Bank Collecting and Preserving the Stories of Katrina and Ritardquo Center for History and New Media wwwhurricanearchiveorg ltviewed 11 January 2013gt and ldquoOccupyArchive Archiving the Occupy Movements from 2011rdquo Centre forHistory and New Media wwwoccupyarchiveorg ltviewed 11 January2013gt

33 Daniel Cohen and Roy Rosenzweig Digital History A Guide to Gathering Preserving and Presenting the Past on the Web(Philadelphia Temple University Press 2005) chapter on ldquoPreservingDigital Historyrdquo The chapter is available fully online atwwwchnmgmuedudigitalhistorypreserving5php ltviewed 19December 2012gt

34 For an overview see ldquoPreserving Our Digital Heritage The NationalDigital Information Infrastructure and Preservation Program 2010 Report a Collaborative Initiative of the Library of CongressrdquoDigitalpreservationgov wwwdigitalpreservationgovmultimediadocumentsNDIIPP2010Report_Postpdf ltviewed 19 December 2012gt

35 Steve Meloan ldquoNo Way to Run a Culturerdquo Wired 13 February 1998available online via the Internet Archivewwwwebarchiveorgweb20000619001705httpwwwwiredcomnewsculture012841030100html ltviewed 19 December 2012gt

36 Some of this has been discussed in greater depth on my website See theseries beginning with Ian Milligan ldquoWARC Files A Challenge forHistorians and Finding Needles in Haystacksrdquo ianmilliganca 12December 2012 wwwianmilliganca20121212warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks ltviewed 20 December2012gt

37 Unstructured data as opposed to structured data is important for histo-rians but brings additional challenges Some projects have producedstructured data novels marked up using XML encoding for example soone can immediately see place names dates character names and soforth This makes it much easier for a computer to read and understand

38 See William Noblett ldquoDigitization A Cautionary Talerdquo New Review ofAcademic Librarianship 17 (2011) 3 Tobias Blanke Michael Bryantand Mark Hedges ldquoOcropodium Open Source OCR for Small-ScaleHistorical Archivesrdquo Journal of Information Science 38 no 1 (2012) 83

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 19: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

Canada just in time for the website to completely disappear by July2007 In the archived version subject indexes and headings were goneAll that remained then and today for that matter is a straight-forwardalphabetical list and a disclaimer that noted ldquo[y]ou are viewing a doc-ument archived by Library and Archives Canada Please noteinformation may be out of date and some functionality lostrdquo

The question now is what we can do with this collection ofwebsites Could a distant reading as opposed to a close reading ona targeted level tell us something about this collectionFortunately we have a variety of tools that can help us navigate thislarge array of information several of which I use to navigate largeamounts of data First I create my own tools through computerprogramming This work is showcased in the following section I program in Wolfram Researchrsquos Mathematica 9 Mathematica is anintegrated platform for technical computing that allows users toprocess visualize and interact with exceptional arrays of informa-tion45 It is a language with an engaged and generous user base thatcomplements exhaustive well-written documentation fromWolfram Research To emphasize its ease one line of code couldimport a website in perfectly readable plaintext (avoiding unicodeerrors) as follows input=Import[ldquohttpactivehis-torycaldquordquoPlaintextrdquo] and a second one could make itall in lower text for textual analysis as such lowerpage=ToLowerCase[input] Variables are a critical building blockIn the former example ldquoinputrdquo now holds the entire plaintext of thewebsite ActiveHistoryca and then ldquolowerpagerdquo holds it in lowercaseOnce we have lower-cased information we are quickly able to extractthe following frequent words frequent phrases and could thus cre-ate a dynamic database of n-grams word clouds or run other textualanalysis programs on the material Much of the syntax adheres to arecognizable English-language format

Even more promising is Mathematicarsquos integration with theWolfram|Alpha database (itself accessible at wolframalphacom) acomputational knowledge engine that promises to put incredibleamounts of information at a userrsquos fingertips With a quick com-mand one can access the temperature at Torontorsquos Lester B PearsonInternational Airport at 500 pm on 18 December 1983 the rela-

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

tive popularity of names such as ldquoEdithrdquo or ldquoEmmardquo the Canadianunemployment rate in March 1985 or the Canadian GrossDomestic Product in December 199546 This allows one to quicklycross-reference findings with statistical information or find coinci-dental points of comparison As universities are now being activelyencouraged to contribute their information to Wolfram|Alpha thiswill become an increasingly powerful database

For many researchers the downside of Mathematica is a con-siderable one It is proprietary software and relatively expensive forstudent and non-student alike Outputs however can be dissemi-nated using the Computational Document Format (CDF) enablinghistorians to put their work online and entice the general public todynamically work with their models and findings

Programming is not the be all and end all Information visual-ization in the humanities is a growing international field withCanadian scholars heavily involved The most important and acces-sible example is Voyant Tools47 It is a powerful suite of textualanalysis tools Uploading onersquos own text or collection of texts draw-ing on a webpage or analyzing existing bodies of work one sees thefollowing a ldquoword cloudrdquo of a corpus click on any word to see itsrelative rise and fall over time (word trends) access the original textand click on any word to see its contextual placement within indi-vidual sentences It ably facilitates the ldquodistancerdquo reading of anyreasonably-sized body of textual information and should be a begin-ning point for those interested in this field The only downside isthat Voyant can choke on very large amounts of information as it isa textual analysis rather than robust data mining tool

There are several other complementary tools A notable one is theMany Eyes website hosted by IBM Research Many Eyes allows you toupload datasets be they textual tabular or numerical There are manyvisualization options provided you can plot relationships betweendata compare values (ie bar charts and bubble charts) track dataover time and apply information to maps The only downside toMany Eyes is that you must register and make your datasets available toall other registered researchers as well as IBM This limits your abilityto use copyrighted or proprietary information

More advanced technical users can avail themselves of a host of

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

textual analysis programs found elsewhere on the web The mostcomprehensive gathering of these tools can be found with theSoftware Environment for the Advancement of Scholarly Research(SEASR) and its Meandre workbench Meandre which can beinstalled as a local server on your system or mdash in a best case scenariomdash hosted by an institutional IT department allows you to tap intoa variety of workflows to analyze data48 One workflow mentionedin this article is the topic modeling toolkit found within the MachineLearning for Language Toolkit (MALLET) package which clustersunstructured data into ldquotopicsrdquo which appear together giving a quicksense of a documentrsquos structure49 Other workflows found withinSEASR include Dunningrsquos log-likelihood statistic which compares aselected document to a broader reference corpus For example youcould compare the lyrics that appear in 1968 against all lyrics in thepostwar era and learn what makes 1968 most unique50 sentimentanalyses which attempt to plot changes in emotional words andphrases over time and a variety of entity extraction algorithms(which find ldquopeoplerdquo ldquoplacesrdquo and so forth in a text) Corpuses thatwere previously too big to extract meaningful information from cannow be explored relatively quickly

As befits a technical side to history all of the freely availabletools discussed briefly here (notably wget SEASR and MALLET)are discussed in more technical terms in the peer-reviewedProgramming Historian 2 or on my own blog at ianmilliganca I haveelected to leave them online rather than provide commands in thisarticle they require some technical knowledge and directly workingfrom the web is quicker than here Please consider the following sec-tion an applied version of some of the lessons you can learn throughthese processes

From Web Collection to Dynamic Finding Aids

Let us briefly return to our archival metaphor Todayrsquos historiandealing with conventional sources often has to physically travel to anarchive requisition boxes perhaps take photographs of materials iftime is tight and otherwise take notes With born-digital sourceshowever the formula is reversed Our sources can come to us In a

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

few keystrokes we can bring an unparalleled quantity of informationto our home or work computers With Canadarsquos Digital Collectionsa single command line entry brought the entire archive onto my sys-tem it took days to process the information and I had to leave mycomputer on but it did not require any active user engagement

How big is the CDC collection in terms of sheer informationThe plaintext without any HTML markup is approximately7805135 words If each page contained 250 words it would be theequivalent of a 31220 page essay or 1000 dissertations However itis even more complicated than just reading 1000 dissertations Theinformation is spread across 73585 separate HTML files themselvespart of a much broader corpus of 278785 files Even the sheer sizeof the database requires big data methodologies an aggregate size of726 GB 360 MB of plaintext with HTML encoding reduced to alsquomerersquo 50MB if all encoding is removed A 50MB plain text docu-ment can quickly slow down even a relatively powerful personalcomputer There is no simple way to make this a conventionallyaccessible amount of information

How can we access this information Several different methodsare possible First we can work with the information on the internetitself or we can download it onto our own system I prefer the lattermethod as it decreases computational time adds flexibility andenables free rein to experiment with material Before you do thisthough you need to be familiar with the websitersquos ldquoterms of servicerdquoor the legal documentation that outlines how users are allowed to usea website resources or software In this case LAC provides generousnon-commercial reproduction terms information ldquomay be repro-duced in part or in whole and by any means without charge orfurther permission unless otherwise specifiedrdquo51

The CDC web archive index is organized like so the index islocated at wwwepelac-bacgcca100205301iccdcEAlphabetaspwith every subsequent website branching off the cdc folder Forexample one website would be hosted in wwwepelac-bacgcca100205301iccdcABbenevolat the next in hellipcdcabnatureand so on Using wget a free software program you can downloadall websites in a certain structure the program recursively downloadsinformation in a methodical fashion I first used wget to save the

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

cookies (little files kept on your system in this case one that notedthat you had seen the disclaimer about the archival nature of thewebsite) and then with an one-line command in my command lineto download the entire database52 Due to the imposition of a ran-dom wait and bandwidth limit which meant that I would downloadone file every half to one and a half seconds at a maximum of 30kilobytes a second the process took six days In the ProgrammingHistorian 2 I have an article expanding upon wget for humanitiesresearchers

Hypertext itself deserves a brief discussion The basic buildingblock of the web is the link one page is connected to another andpeople navigate through these webs While books and articles forexample are connected to each other by virtue of ideas and concretecitations these must often be inferred online HTML born-digitalsources are different in that the link offers a very concrete file One

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Visualizing the CDC Collection as a Series of Hyperlinks

quick visualization the circular cluster seen above lets us see theunique nature of this information We can go to each web pagescrape the links and see where they take viewers As we see in thebelow figure the CDC collection can be visualized as a series ofhyperlinks (each line) The cluster in the middle is the CDC collec-tion itself whereas the strands on the outside represent externalwebsites Using Mathematica I have generated a dynamic modelyou can move your mouse over each line learn what link it is andthus see where the collection fits into the broader Internet The siteis limited to following two links beyond the first page Through thisapproach we see the interconnection of knowledge and relationshipsbetween different sources At a glimpse we can see suggestions to fol-low other sites and determine if any nodes are shared amongst thevarious websites revealed

Visualization Moving Beyond Word Clouds

Where to begin with a large collection such as the Canadarsquos DigitalCollection corpus As a historian who is primarily interested in thewritten word my first step was to isolate the plaintext throughout thematerial so that I could begin rudimentary data mining and textualanalysis on the corpus For this I approached the CDC collection ontwo distinct levels mdash first the aggregate collection and second divid-ing the collection into each lsquofilersquo or individual website mdash derivingword and phrase data for each Imagine if you could look throughinto an archival collection and get a quick sense of what topics andideas were covered Taking that thought further imagine the utility ifyou could do this with individual archival boxes and even files Witha plethora of born-digital sources this is now a necessary step

On an aggregate level word and phrase data is a way to explorethe information within How often do various words within thenearly eight million words of the collection appear UsingMathematica we can quickly generate a list of words sorted by theirrelative frequency we learn that the word ldquoCanadardquo appears 26429times ldquocollectionsrdquo 11436 and ldquo[L]ouisbourgrdquo an astounding9365 times Reading such a list can quickly wear out a reader and isof minimal visual attractiveness Two quick decisions were made to

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

enhance data usability first the information was ldquocase-foldedrdquo allsent to lower-case this enables us to count the total appearances of aword such as ldquodigitalrdquo regardless of whether or not it appears at thebeginning of a sentence

Second when dealing with word frequency mdash as opposed tophrase frequency mdash ldquostop wordsrdquo should be removed as they werehere These are ldquoextremely common words which would appear to beof little value in helping select documents matching a user needrdquowords such as ldquotherdquo ldquoandrdquo ldquotordquo and so forth They matter forphrases (the example ldquoto be or not to berdquo is comprised entirely ofstop words) but will quickly dominate any word-level visualizationor list53 Other words may need to be added or removed from stoplists depending on the corpus for example if the phrase ldquopage num-berrdquo appeared on every page that would be useless information forthe researcher

The first step then is to use Mathematica to format word fre-quency information into Wordlenet readable format (if we put allof the text into that website it will crash mdash so we need to followthe instructions on that website) we can then draw up a wordcloud of the entire collection Here we can see both the advantagesand disadvantages of static word clouds What can we learn from this

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A Word Cloud of All Words in the CDC Corpus

illustration First without knowing anything about the sites we canimmediately learn that it is dealing with CanadaCanadians that itis a collection with documents in French and English that we aredealing with digitized collections and that there is a fair amount ofregional representation not based upon population (note thatldquoontariordquo is less frequent than both ldquonewfoundlandrdquo ldquonova scotiardquoand ldquoalbertardquo although ldquoqueacutebecrdquo and ldquoquebecrdquo is not terribly promi-nent) Certain kinds of history dominate communities familynames the built form (houses churches schools and so forth) At avery macro level we can learn several characteristics before doing anydeeper investigation

These forms of visualization need to be used with extreme cau-tion however First while visually attractive it leaves a severeamount of ambiguity are ldquoontariordquo and ldquoqueacutebecrdquo equally promi-nent for example or is there a slight divergence Even moreproblematically we cannot learn context from the above picturedoes the word ldquoschoolrdquo refer to built structures or educational expe-riences The same question with applies to ldquochurchrdquo Do mentionsof ldquonewfoundlandrdquo refer to historical experiences there or withexpatriates or in travel narratives Word clouds are useful first stepsand have their place but need to be used with extreme caution Formore scholarly applications tables offer much more precise inter-pretation As the second figure below demonstrates they can bemore helpful when dealing with the individual sites cuttingthrough ambiguity The figure below is a visualization of the web-site ldquopast and presentrdquo Despite the ambiguous title we can quicklylearn what this website contains from this visualization Wordle isthus a useful first step analogous to taking magically taking a quicklook through the side of an archival box We still need more infor-mation however to really make this form of textual analysishelpful To navigate this problem I have created my own dynamicfinding aid program using Mathematica which scales to any born-digital web collection

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Two approaches were taken First a textual representation of wordand phrase frequency allows us to determine the content of a given filerelatively quickly I wrote a short program to extract word frequency(stop words excluded) and phrase frequency For phrases we speak ofn-grams bigrams trigrams quadgrams and fivegrams although onecould theoretically do any higher number A bigram can be a combina-tion of two characters such as lsquobirsquo two syllables or mdash pertinent in ourcase mdash two words An example ldquocanada isrdquo A trigram is three wordsquadgram is four words and a fivegram is five words and so on

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Word Cloud of the File ldquoPast and Presentrdquo

A viewer for n-grams in the CDC Collection

The CDC collection demonstrates this approachrsquos utility Usingmy Mathematica textual CDC browser pictured above I randomlyexplored file 267 the ambiguously named ldquopast and presentrdquo web-site This is one of the largest individual websites 122363 words orabout 493 standard typed pages Such an ambiguous name could beabout almost anything Initially I decided to study word frequency

ldquoldquoalbertardquo 758rdquo ldquoldquolistenrdquo 490rdquo ldquoldquosettlersrdquo 467rdquoldquoldquonewrdquo 454rdquo ldquoldquocanadardquo 388rdquo ldquoldquolandrdquo 362rdquo ldquoldquoreadrdquo328rdquo ldquoldquopeoplerdquo 295rdquo ldquoldquosettlementrdquo 276rdquo ldquoldquochineserdquo268rdquo ldquoldquocanadianrdquo 264rdquo ldquoldquoplacerdquo 255rdquo ldquoldquoheritagerdquo254rdquo ldquoldquoedmontonrdquo 250rdquo ldquoldquoyearsrdquo 238rdquo ldquoldquowestrdquo 236rdquoldquoldquoraymondrdquo 218rdquo ldquoldquoukrainianrdquo 216rdquo ldquoldquocamerdquo 213rdquoldquoldquocalgaryrdquo 209rdquo ldquoldquonamesrdquo 206rdquo ldquoldquoranchrdquo 204rdquoldquoldquoupdatedrdquo 193rdquo ldquoldquohistoryrdquo 190rdquo ldquoldquocommunitiesrdquo183rdquo hellip

Rather than having to visit the website at a glance we can almostimmediately see what the site is about More topics become apparentwhen one looks at trigrams for example in this automatically gen-erated output of generated top phrases

ldquoldquofirstrdquo ldquopeoplerdquo ldquosettlersrdquo 100rdquo ldquoldquotherdquo ldquounitedrdquoldquostatesrdquo 72rdquo ldquoldquoonerdquo ldquoofrdquo ldquotherdquo 59rdquo ldquoldquoofrdquo ldquotherdquoldquowestrdquo 58rdquo ldquoldquosettlersrdquo ldquolastrdquo ldquoupdatedrdquo 56rdquo ldquoldquopeoplerdquoldquosettlersrdquo ldquolastrdquo 56rdquo ldquoldquoalbertansrdquo ldquofirstrdquo ldquopeoplerdquo 56rdquoldquoldquoadventurousrdquo ldquoalbertansrdquo ldquofirstrdquo 56rdquo ldquoldquotherdquo ldquobarrdquoldquourdquo 55rdquo ldquoldquoopeningrdquo ldquoofrdquo ldquotherdquo 54rdquo ldquoldquocommunitiesrdquoldquoadventurousrdquo ldquoalbertansrdquo 54rdquo ldquoldquonewrdquo ldquocommunitiesrdquoldquoadventurousrdquo 54rdquo ldquoldquobarrdquo ldquourdquo ldquoranchrdquo 53rdquo ldquoldquolistenrdquoldquotordquo ldquotherdquo 52rdquo hellip

Through a few clicks we can navigate and determine the basiccontours mdash from a distance mdash of what the site pertains to In aviewer you can use a slider to dynamically see more informationor less information quickly toggling between bigrams trigramsand quadgrams allowing you to quickly hone in on immediately

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

relevant information This can be useful but we need to moveback even further to achieve a distant reading of this born-digitalcollection

If one zooms back out to this very distant level we can extractn-gram data from the entire collection We have two approachesfirst aggregate appearances of a given term (if ldquoaboriginalrdquo appears457 times then n=457) or second relative appearances of a giventerm (if ldquoaboriginalrdquo appears in 45 of 384 sites then n=45384 or01171875) In my other work on music lyrics the latter optionenables you to control for the lsquochorusrsquo effect as the former can be drowned out by one song with repetitive words Given thenature of this collection I elected for the former approach Whatcan we learn from trigrams for example After an initial set of trigrams pertaining to the Louisbourg digitization project whichhas boilerplate language on each page as well as common ldquostop-wordrdquo equivalents of phrases (especially pronounced in French) we come across the first set of meaningful phrases excerptedbelow

ldquoexperiencerdquo ldquoinrdquo ldquotherdquo 919 ldquotransferredrdquo ldquotordquo ldquotherdquo918 ldquofundedrdquo ldquobyrdquo ldquotherdquo 918 ldquoofrdquo ldquotherdquo ldquoprovincialrdquo915 ldquothisrdquo ldquowebrdquo ldquositerdquo 915 ldquo409rdquo ldquoregistryrdquo ldquo2rdquo914 ldquoarerdquo ldquonordquo ldquolongerrdquo 914 ldquogouvernementrdquo ldquodurdquo ldquocanadardquo 910 ldquouserdquo ldquothisrdquo ldquolinkrdquo 887ldquopagerdquo ldquoorrdquo ldquoyourdquo 887 ldquofollowingrdquo ldquopagerdquo ldquoorrdquo 887 ldquoautomaticallyrdquo ldquotransferredrdquo ldquotordquo 887 ldquoberdquoldquoautomaticallyrdquo ldquotransferredrdquo 887 ldquowillrdquo ldquoberdquo ldquoautomaticallyrdquo 887 ldquoindustryrdquo ldquoyourdquo ldquowillrdquo 887ldquomultimediardquo ldquoindustryrdquo ldquoyourdquo 887 ldquopracticalrdquo ldquoexperiencerdquo ldquoinrdquo 887 ldquogainedrdquo ldquopracticalrdquo ldquoexperiencerdquo887 ldquotheyrdquo ldquogainedrdquo ldquopracticalrdquo 887 ldquowhilerdquo ldquotheyrdquoldquogainedrdquo 887 ldquocanadiansrdquo ldquowhilerdquo ldquotheyrdquo 887ldquoyoungrdquo ldquocanadiansrdquo ldquowhilerdquo 887 ldquoshowcaserdquo ldquoofrdquoldquoworkrdquo 887 ldquoexcellentrdquo ldquoshowcaserdquo ldquoofrdquo 887 hellip

Here we see an array of common language funding informa-tion link information project information spread across over 887

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

websites Reading down we see information about governments jobexperience project information initiatives re-structuring and soforth At a glance we can see what language is common to all web-sites as opposed to unique information Moving down we find themore common but less universal ldquothe history ofrdquo for example ldquoher-itage community foundationrdquo libraries and so forth By the end ofthe list we can move into the unique or very rare phrases that appearonly once twice or three times

Moving beyond word and phrase frequency there are a fewmore sophisticated off-the-shelf options that can help us make senseof large arrays of unstructured data Topic modeling is among themost developed as Princetonrsquos David M Blei (a Computer Scienceprofessor) describes in a general introduction to the topic

While more and more texts are available online we simplydo not have the human power to read and study them toprovide the kind of [traditional] browsing experiencedescribed above To this end machine learning researchershave developed probabilistic topic modeling a suite of algo-rithms that aim to discover and annotate large archives ofdocuments with thematic information Topic modelingalgorithms are statistical methods that analyze the words ofthe original texts to discover the themes that run throughthem how those themes are connected to each other andhow they change over time hellip Topic modeling enables usto organize and summarize electronic archives at a scalethat would be impossible by human annotation55

Using the aforementioned SEASR Meandre workbench (or viaMALLET itself and a command-line interface) we can quickly beginto run our born-digital information through the topic-modelingalgorithm The results are notable and come in two formats a wordcloud visualization as well as an XML file containing a line-by-linebreakdown of each topic Taking the aforementioned file 267 (ldquopastand presentrdquo) we can begin to get a sense of the topics this large web-site discusses

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

At a glance we can learn about some of the main themes the struc-ture of Alberta (Alberta towns bridges lakes) the geographic shapeof the area (life land lands areas links grass and interestingly lan-guage) the stories (story knight archives but also interestinglyknight and wife) and so forth The word cloud gives us a sense ofthe most relevant terms that define each topic but we also have a3200-line XML file broken into topics which lets us further refineour inquiry

A last set of examples will help us move from the aggregate levelof hundreds of thousands of lines of text to the individual line inessence creating our own search engine to navigate this collection

There are a few good reasons why we should not use Google todo all of this in our own work First we do not know how Googleworks precisely It is a proprietary black box While the original algo-rithm is available the actual methods that go into determining theranking of search terms is a trade secret Second Google is optimizedfor a number of uses mainly to deliver the most relevant informa-tion to the user Historians and archivists are looking often forspecific information within old websites and it behooves us to exam-ine them through the lens of more specialized tools With thesehesitations in mind I have sought to develop my own rudimentarysearch engine for this born-digital collection

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A two-pronged approach can take us to relevant informationquickly In the following example I am carrying out a project onblack history We quickly search all of the CDC word frequency filesfor the occurrence of the word ldquofreedomrdquo ldquoblackrdquo and ldquolibertyrdquo(these terms of course require some background knowledge thesearch is shaped by the user) obtain a list and then map it onto achart in order to quickly visualize their locations in the collection Forexample in the graph above we see a visualization of a given term inthis case our three search terms As one moves the mouse over eachbar the frequency data for a given term appears We can also zoomand enhance this graph to use it as we wish A demonstration of thiscan be found online at wwwianmilliganca20130118cdc-tool-1

We can then move down into the data In the example aboveldquo047rdquo labeled ldquoblackgoldtxtrdquo contains 31 occurrences of the wordldquoontariordquo Moving to my Mathematica keyword-in-context routinewe can see where these occurrences of the word appear

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

A snapshot of the dynamic finding aid tool

We see that this website focuses mostly on southwestern Ontariowhere many oil fields are If we move forward we can see that it per-tains to the nineteenth century a particular region as well assubsequent memorialization activities (mention of a museum forexample) All this can be done from within the textual client asopposed to going out into individual websites Broad contoursemerge from this distant reading but they also facilitate deeper read-ing by allowing us to promptly realize critical sources

Another model through which we can visualize documents isthrough a viewer primarily focusing on term document frequencyand inverse frequency terms discussed below To prepare this mater-ial we are combining our earlier data about word frequency (that theword ldquoexperiencerdquo appears 919 times perhaps) with aggregate infor-mation about frequency in the entire corpus (ie the wordldquoexperiencerdquo appears 1766 times) The number of times that a termappears in a given website can be defined as tf (term frequency)Another opportunity is inverse document frequency where we canlearn what words are most distinct and unique in a given documentThe formula for this is the oft-cited tf-idf a staple of informationretrieval theory or tf times the logarithm of the number of docu-ments divided into the number of documents where the given termappears At a glance then we can see some of the unique words con-tained within a website if we select ldquoinverse document frequencyrdquo

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

At a very quick glance one can see the major words that make this aunique document This technique can be used on any textual docu-ment

Off-the-shelf tools can facilitate a deeper yet distant reading ofindividual websites as well especially the suite of research tools onwebsites such as Many Eyes and Voyant Using these tools howeverrequires that one puts individual data into each separate programrather than quickly navigating between them This stems from themassive amount of files and data involved Many Eyes limits files to5MB and Voyant has no hard and fast rule but can struggle withlarger files

This analysis was carried out on my personal computer55 Largerdata sets may necessitate a corresponding increase in computationalpower bringing social sciences and humanities researchers into con-tact with High Performance Computing (HPC) centres or consortiaThese extremely powerful computers provide researchers with com-puting power orders of magnitude above that of personal computersallowing results in hours instead of the weeks or months it might

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

take to process on a smaller system56 This has particular applicationin the basic processing of a very large web archive There are limita-tions however jobs must be submitted into queues in advancerather than on-site processing and perhaps most importantly HPCcentres prefer to run highly optimized code to minimize usage oftheir very expensive and scarcely allocated equipment As this level ofcoding expertise is often beyond humanities researchers researchteams must be formed

Humanities researchers would do well to familiarize themselveswith local HPC options many of which are more than happy towork on new and engaging questions These are accessible optionsin Ontario for example all but one university belongs to a HPCconsortium57 Yet as personal computing continues to dramaticallyincrease in speed and as storage prices continue to plummet textualanalysis is increasingly feasible on high-quality office systems An ini-tial exposure can generate questions and problems that if the needrears itself can subsequently be taken to the supercomputer

This case study is intended to provide one model for distantreading and the generation of finding aids in the interest of showingsome of the potential for working with born-digital sources It isintended to begin or jumpstart a conversation around the resourcesand skills necessary for the next generation A seemingly insur-mountable amount of information contained across over 73000files (as in the CDC collection) can be quickly navigated and iso-lated for researcher convenience

Conclusion Digital Literacy for the Next Generation ofHistorians

As a discipline I would like to suggest one path forward prepare thenext generation of historians at least some of them to deal with thedeluge of born-digital sources Fundamentally the challenge of born-digital sources is a pedagogical one For existing historians this is anissue of continual education for future historians an issue of profes-sional training (the annual Digital Humanities Summer Instituteheld at the University of Victoria each June is worth considering asan introduction to many digital concepts) Some off-the-shelf tools

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

can serve as gateways to deeper engagement notably the VoyantTools suite (wwwvoyant-toolsorg) and perhaps subsequently thelessons contained in the Programming Historian 2 That said digitalliteracy courses should become part of the essential pantheon of his-torical curricula This should not be mandatory but instead taughtalongside and within traditional historiography and research meth-ods courses In an era of diminishing resources and attention paid toknowledge translation and relevancy digital literacy courses are alsowell positioned to provide graduates with concrete portfolios andactionable skills If we want to be prepared to use digital method-ologies to study the post-1990 world we will need ready historians

Such a course should have six basic priorities First we must pro-vide historiographical context for new students on whatcomputational methods have been tried before How have quantita-tive and qualitative methods differed yet ultimately complementedeach other Second students in this course would need to develop anunderstanding of the digital basics cloud computing effectivebackup and versioning digital security and an understanding of howthe Internet is organized Third for effective cross-departmental andsustained engagement with digital history students must to learn howto digitally navigate conventional sources notable skill sets wouldinclude a familiarity with citation management software databasesand the possibility of using optical character recognition on a varietyof born-paper sources subsequently digitized through scanners orcameras This must also include a grasp of the scope and significanceof large digital depositories including Google Books the InternetArchive and others mentioned earlier Fourth students must learnhow to process large amounts of digital information this includes off-the-shelf visualization tools material discussed here and some of thephilosophical and methodological implications of a distance readingFifth students can learn basic programming through the ProgrammingHistorian Finally and perhaps most importantly students need anunderstanding of various ways to openly disseminate material on the web whether through blogging micro-blogging publishing webpages and so forth it is important to enable the great work that we do in the university to make its mark on the world Throughthis students will also come away from history departments with

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

comprehensive and marketable digital portfolios Such a coursewould focus more on homework than reading with special emphasisplaced on deliverable content such as websites interactive modelsand broadly implementable databases

Digital history and the readiness to work with born-digitalsources cannot stem out of pedagogy alone however Institutionalsupport is necessary While based on anecdotal evidence it appearsthat some university administrators have begun to embrace the shifttowards the digital humanities At the federal level SSHRC has iden-tified the Digital Economy as one of its priority areas58 Yet when itcomes to evaluation doctoral and post-doctoral researchers do nothave a ldquodigital humanitiesrdquo option open to them in terms of assessingcommittees leading to fears that their projects might be lost inbetween disciplinary areas of focus59 In light of the impending crisisof born-digital sources historians should encourage these initiatives

These sorts of digital projects do not always lend themselveswell to traditional publication venues The examples above from theCDC case study for example are best viewed as dynamically mal-leable models this will be one challenge as historians grapple withnew publication models This article argues first why digitalmethodologies are necessary and second how we can realize aspectsof this call The Internet is 20 years old and before historians realizeit we will be faced with the crisis of born-digital information over-load If history is to continue as the leading discipline inunderstanding the social and cultural past decisive movementtowards the digital humanities is necessary When the sea comes inwe must be ready to swim

IAN MILLIGAN is an assistant professor of Canadian history at theUniversity of Waterloo He is a founding co-editor of ActiveHistorycaan editor-at-large of the Programming Historian 2 and has written sev-eral articles in the fields of Canadian youth digital and labour history

IAN MILLIGAN est professeur adjoint en histoire canadienne agrave laUniversity of Waterloo Il est le fondateur et codirecteur du site

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Internet ActiveHistoryca ainsi qursquoun eacutediteur du site ProgrammingHistorian 2 Il a eacutecrit plusieurs articles au sujet de la jeunesse cana-dienne de la numeacuterisation et de lrsquohistoire des travailleurs

Endnotes

1 The question of when the past becomes a ldquohistoryrdquo to be professionallystudied is a tricky one Yet it is worth noting that the still-definitivework on the Baby Boom Generation appeared in 1996 with DougOwram Born at the Right Time A History of the Baby Boom Generationin Canada (Toronto University of Toronto Press 1996) This actuallyfollowed the earlier scholarly monograph Cyril Levitt Children ofPrivilege Student Revolt in the Sixties (Toronto University of TorontoPress 1984) and a host of other laudable works As academicdebatesworks on the 1960s now suggest the past becomes historysooner than we often expect

2 A critical point advanced in the field of literary criticism and theory byStephen Ramsay Reading Machines Towards an Algorithmic Criticism(Urbana University of Illinois Press 2011)

3 See ldquoIdleNoMore Timelinerdquo ongoingwwwidlenomoremakookcatimeline44 ltviewed 18 January 2013gt

4 These issues have been discussed elsewhere albeit in either a moreabridged form or with a focus on digitized historical sources A notableexception is Daniel J Cohen and Roy Rosenzweig Digital History AGuide to Gathering Preserving and Presenting the Past on the Web(Philadelphia University of Pennsylvania Press 2005) specifically theirchapter on ldquoCollecting History onlinerdquo Their focus is on the preservation-ist and archival challenges however Important notes are also found inDan Cohen ldquoHistory and the Second Decade of the Webrdquo RethinkingHistory 8 no 2 (June 2004) 293ndash301 also available online atwwwchnmgmueduessays-on-history-new-mediaessaysessayid=34

5 Figure taken from Jose Martinez ldquoTwitter CEO Dick Costolo Reveals Staggering Number of Tweets Per Dayrdquo ComplexTech wwwcomplexcomtech201210twitter-ceo-dick-costolo-reveals-staggering-number-of-tweets-per-day ltviewed 6 January 2013gtFurther more while a tweet itself is only 140 characters every tweetcomes with over 40 lines of metadata date information author biogra-phy and location information when the account was created the usersrsquofollowing and follower numbers country data and so forth This is aremarkable treasure trove of information for future historians but also apotential source of information overload See Sarah Perez ldquoThis is What

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

a Tweet Looks Likerdquo ReadWriteWeb (19 April 2010) wwwreadwritewebcomarchivesthis_is_what_a_tweet_looks_likephpltviewed 6 January 2013gt

6 This is further discussed in Ian Foster ldquoHow Computation ChangesResearchrdquo in Switching Codes Thinking Through Digital Technology inthe Humanities and the Arts eds Thomas Bartscherer and RoderickCoover (Chicago University of Chicago Press 2011) esp 22ndash4

7 Franco Moretti Graphs Maps Trees Abstract Models for Literary History(New York Verso 2005) 3ndash4

8 While not of prime focus here the longue dureacutee deserves definition Itstands opposite the history of events of instances covering instead avery long time span in an interdisciplinary social science framework In along essay Braudel noted that one can then see both structural crises ofeconomies and structures that constrain human society and develop-ment For more see Fernand Braudel ldquoHistory and the Social SciencesThe Longue Dureacuteerdquo in Fernand Braudel On History trans SarahMatthews (Chicago University of Chicago Press 1980) 27 The essaywas originally printed in the Annales ESC no 4 (OctoberndashDecember1958)

9 A great overview can be found at ldquoText Analysisrdquo Tooling Up for DigitalHumanities 2011 wwwtoolingupstanfordedupage_id=981 ltviewed19 December 2012gt

10 Jean-Baptiste Michel Yuan Kui Shen Aviva Presser Aiden Adrian VeresMatthew K Gray The Google Books Team Joseph P Pickett DaleHoiberg Dan Clancy Peter Norvig Jon Orwant Steven Pinker MartinA Nowak Erez Lieberman Aiden ldquoQuantitative Analysis of CultureUsing Millions of Digitized Booksrdquo Science 331 (14 January 2011)176ndash82 The tool is publicly accessible at ldquoGoogle Books NgramViewerrdquo Google Books 2010 wwwbooksgooglecomngrams ltviewed11 January 2013gt

11 Anthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

12 Comment by Jean-Baptiste Michel and Erez Lieberman Aiden onAnthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

13 For more information see the Canadian Century Research InfrastructureProject website at wwwccriuottawacaCCRIHomehtml and the Programme de recherche en deacutemographie historique at wwwgeneologyumontrealcafrleprdhhtm

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

14 ldquoWith Criminal Intentrdquo CriminalIntentorg wwwcriminalintentorgltviewed 20 December 2012gt

15 For an overview see ldquoAward Recipients mdash 2011rdquo Digging into DataWebpage wwwdiggingintodataorgHomeAwardRecipients2011tabid185Defaultaspx ltviewed 20 December 2012gt

16 One can do no better than the completely free Programming Historianwhich uses entirely free and open-source programming tools to learn thebasics of computational history See The Programming Historian 2 2012-Present wwwprogramminghistorianorg ltviewed 11 January 2013gt

17 Michael Katz The People of Hamilton Canada West Family and Class ina Mid-Nineteenth-Century City (Cambridge MA Harvard UniversityPress 1975) Also see A Gordon Darroch and Michael D OrnsteinldquoEthnicity and Occupational Structure in Canada in 1871 The VerticalMosaic in Historical Perspectiverdquo Canadian Historical Review 61 no 3(1980) 305ndash33

18 Ian Anderson ldquoHistory and Computingrdquo Making History The ChangingFace of the Profession in Britain 2008 wwwhistoryacukmakinghistoryresourcesarticleshistory_and_computinghtml ltviewed 20 December2012gt

19 Christopher Collins ldquoText and Document Visualizationrdquo TheHumanities and Technology Camp Greater Toronto Area VisualizationBootcamp University of Ontario Institute of Technology Oshawa Ont21 October 2011

20 The quotations by Luther and Poe are commonly used to make thispoint For the best example see Clay Shirky Cognitive SurplusCreativity and Generosity in a Connected Age Google eBook Penguin2010 For Mumford see Lewis Mumford The Myth of the Machine vol2 The Pentagon of Power (New York Harcourt Brace 1970) 182 asquoted in Gleick The Information A History A Theory A Flood (NewYork Pantheon 2011) 404

21 Adam Ostrow ldquoCisco CRS-3 Download Library of Congress in JustOver One Secondrdquo Mashable (9 March 2010) accessed onlinewwwmashablecom20100309cisco-crs-3

22 A note on data size is necessary The basic unit of information is a ldquobyterdquo (8 bits) which is roughly a single character A kilobyte (KB) is athousand bytes a megabyte (MB) a million a gigabyte (GB) a billion aterabyte (TB) a trillion and a petabyte (PB) a quadrillion Beyond thisthere are exabytes (EB) zettabytes (ZB) and yottabytes (YB)

23 These figures can of course vary depending on compression methodschosen Their math was based on an average of 8MB per book if it wasscanned at 600dpi and then compressed There are 26 million books inthe Library of Congress For more see Mike Ashenfelder ldquoTransferringlsquoLibraries of Congressrsquo of Datardquo Library of Congress Digital Preservation

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Blog 11 July 2011wwwblogslocgovdigitalpreservation201107transferring-libraries-of-congress-of-data ltviewed 11 January 2013gt

24 Gleick 232 25 ldquoAbout the Library of Congress Web Archivesrdquo Library of Congress Web

Archiving FAQs Undated wwwlocgovwebarchivingfaqhtmlfaqs_02ltviewed 11 January 2013gt

26 ldquo10000000000000000 Bytes Archivedrdquo Internet Archive Blog 26October 2012 wwwblogarchiveorg2012102610000000000000000-bytes-archivedltviewed 19 December 2012gt

27 Jean-Baptiste Michel et al 176ndash82 Published Online Ahead of Print16 December 2010 wwwsciencemagorgcontent3316014176ltviewed 29 July 2011gt See also Scott Cleland ldquoGooglersquos PiracyLiabilityrdquo Forbes 9 November 2011 wwwforbescomsitesscottcle-land20111109googles-piracy-liability ltviewed 11 January 2013gt

28 ldquoGovernment of Canada Web Archiverdquo Library and Archives Canada 17October 2007 wwwcollectionscanadagccawebarchivesindex-ehtmlltviewed 11 January 2013gt

29 For background see Bill Curry ldquoVisiting Library and Archives inOttawa Not Without an Appointmentrdquo Globe and Mail 1 May 2012wwwtheglobeandmailcomnewspoliticsottawa-notebookvisiting-library-and-archives-in-ottawa-not-without-an-appointmentarticle2418960 ltviewed 22 May 2012gt and the announcement ldquoLAC Begins Implementation of New Approach to service Deliveryrdquo Library and Archives Canada February 2012wwwcollectionscanadagccawhats-new013-560-ehtml ltviewed 22May 2012gt For professional responses see the ldquoCHArsquos comment onNew Directions for Library and Archives Canadardquo CHA Website March2012 wwwcha-shccaenNews_39items19html ltviewed 19December 2012gt and ldquoSave Library amp Archives Canadardquo CanadianAssociation of University Teachers launched AprilMay 2012 wwwsavelibraryarchivesca ltviewed 19 December 2012gt

30 Ian Milligan ldquoThe Smokescreen of lsquoModernizationrsquo at Library andArchives Canadardquo ActiveHistoryca 22 May 2012wwwactivehistoryca201205the-smokescreen-of-modernization-at-library-and-archives-canada ltviewed 19 December 2012gt

31 ldquoSave LAC May 2012 Campaign Updaterdquo SaveLibraryArchivesca May2012 wwwsavelibraryarchivescaupdate-2012-05aspx ltviewed 19December 2012gt

32 The challenges and opportunities presented by such projects is extensively discussed in Daniel J Cohen ldquoThe Future of Preserving the Pastrdquo CRM The Journal of Heritage Stewardship 2 no 2

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

(Summer 2005) 6ndash19 also available freely online at the RoyRosenzweig Center for History and New Mediawwwchnmgmueduessays-on-history-new-mediaessaysessayid=39For more see ldquoSeptember 11 Digital Archive Saving the Histories of September 11 2001rdquo Centre for History and New Media andAmerican Social History Project Center for Media and Learningwww911digitalarchiveorg ltviewed 11 January 2013gt ldquoHurricaneDigital Memory Bank Collecting and Preserving the Stories of Katrina and Ritardquo Center for History and New Media wwwhurricanearchiveorg ltviewed 11 January 2013gt and ldquoOccupyArchive Archiving the Occupy Movements from 2011rdquo Centre forHistory and New Media wwwoccupyarchiveorg ltviewed 11 January2013gt

33 Daniel Cohen and Roy Rosenzweig Digital History A Guide to Gathering Preserving and Presenting the Past on the Web(Philadelphia Temple University Press 2005) chapter on ldquoPreservingDigital Historyrdquo The chapter is available fully online atwwwchnmgmuedudigitalhistorypreserving5php ltviewed 19December 2012gt

34 For an overview see ldquoPreserving Our Digital Heritage The NationalDigital Information Infrastructure and Preservation Program 2010 Report a Collaborative Initiative of the Library of CongressrdquoDigitalpreservationgov wwwdigitalpreservationgovmultimediadocumentsNDIIPP2010Report_Postpdf ltviewed 19 December 2012gt

35 Steve Meloan ldquoNo Way to Run a Culturerdquo Wired 13 February 1998available online via the Internet Archivewwwwebarchiveorgweb20000619001705httpwwwwiredcomnewsculture012841030100html ltviewed 19 December 2012gt

36 Some of this has been discussed in greater depth on my website See theseries beginning with Ian Milligan ldquoWARC Files A Challenge forHistorians and Finding Needles in Haystacksrdquo ianmilliganca 12December 2012 wwwianmilliganca20121212warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks ltviewed 20 December2012gt

37 Unstructured data as opposed to structured data is important for histo-rians but brings additional challenges Some projects have producedstructured data novels marked up using XML encoding for example soone can immediately see place names dates character names and soforth This makes it much easier for a computer to read and understand

38 See William Noblett ldquoDigitization A Cautionary Talerdquo New Review ofAcademic Librarianship 17 (2011) 3 Tobias Blanke Michael Bryantand Mark Hedges ldquoOcropodium Open Source OCR for Small-ScaleHistorical Archivesrdquo Journal of Information Science 38 no 1 (2012) 83

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 20: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

tive popularity of names such as ldquoEdithrdquo or ldquoEmmardquo the Canadianunemployment rate in March 1985 or the Canadian GrossDomestic Product in December 199546 This allows one to quicklycross-reference findings with statistical information or find coinci-dental points of comparison As universities are now being activelyencouraged to contribute their information to Wolfram|Alpha thiswill become an increasingly powerful database

For many researchers the downside of Mathematica is a con-siderable one It is proprietary software and relatively expensive forstudent and non-student alike Outputs however can be dissemi-nated using the Computational Document Format (CDF) enablinghistorians to put their work online and entice the general public todynamically work with their models and findings

Programming is not the be all and end all Information visual-ization in the humanities is a growing international field withCanadian scholars heavily involved The most important and acces-sible example is Voyant Tools47 It is a powerful suite of textualanalysis tools Uploading onersquos own text or collection of texts draw-ing on a webpage or analyzing existing bodies of work one sees thefollowing a ldquoword cloudrdquo of a corpus click on any word to see itsrelative rise and fall over time (word trends) access the original textand click on any word to see its contextual placement within indi-vidual sentences It ably facilitates the ldquodistancerdquo reading of anyreasonably-sized body of textual information and should be a begin-ning point for those interested in this field The only downside isthat Voyant can choke on very large amounts of information as it isa textual analysis rather than robust data mining tool

There are several other complementary tools A notable one is theMany Eyes website hosted by IBM Research Many Eyes allows you toupload datasets be they textual tabular or numerical There are manyvisualization options provided you can plot relationships betweendata compare values (ie bar charts and bubble charts) track dataover time and apply information to maps The only downside toMany Eyes is that you must register and make your datasets available toall other registered researchers as well as IBM This limits your abilityto use copyrighted or proprietary information

More advanced technical users can avail themselves of a host of

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

textual analysis programs found elsewhere on the web The mostcomprehensive gathering of these tools can be found with theSoftware Environment for the Advancement of Scholarly Research(SEASR) and its Meandre workbench Meandre which can beinstalled as a local server on your system or mdash in a best case scenariomdash hosted by an institutional IT department allows you to tap intoa variety of workflows to analyze data48 One workflow mentionedin this article is the topic modeling toolkit found within the MachineLearning for Language Toolkit (MALLET) package which clustersunstructured data into ldquotopicsrdquo which appear together giving a quicksense of a documentrsquos structure49 Other workflows found withinSEASR include Dunningrsquos log-likelihood statistic which compares aselected document to a broader reference corpus For example youcould compare the lyrics that appear in 1968 against all lyrics in thepostwar era and learn what makes 1968 most unique50 sentimentanalyses which attempt to plot changes in emotional words andphrases over time and a variety of entity extraction algorithms(which find ldquopeoplerdquo ldquoplacesrdquo and so forth in a text) Corpuses thatwere previously too big to extract meaningful information from cannow be explored relatively quickly

As befits a technical side to history all of the freely availabletools discussed briefly here (notably wget SEASR and MALLET)are discussed in more technical terms in the peer-reviewedProgramming Historian 2 or on my own blog at ianmilliganca I haveelected to leave them online rather than provide commands in thisarticle they require some technical knowledge and directly workingfrom the web is quicker than here Please consider the following sec-tion an applied version of some of the lessons you can learn throughthese processes

From Web Collection to Dynamic Finding Aids

Let us briefly return to our archival metaphor Todayrsquos historiandealing with conventional sources often has to physically travel to anarchive requisition boxes perhaps take photographs of materials iftime is tight and otherwise take notes With born-digital sourceshowever the formula is reversed Our sources can come to us In a

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

few keystrokes we can bring an unparalleled quantity of informationto our home or work computers With Canadarsquos Digital Collectionsa single command line entry brought the entire archive onto my sys-tem it took days to process the information and I had to leave mycomputer on but it did not require any active user engagement

How big is the CDC collection in terms of sheer informationThe plaintext without any HTML markup is approximately7805135 words If each page contained 250 words it would be theequivalent of a 31220 page essay or 1000 dissertations However itis even more complicated than just reading 1000 dissertations Theinformation is spread across 73585 separate HTML files themselvespart of a much broader corpus of 278785 files Even the sheer sizeof the database requires big data methodologies an aggregate size of726 GB 360 MB of plaintext with HTML encoding reduced to alsquomerersquo 50MB if all encoding is removed A 50MB plain text docu-ment can quickly slow down even a relatively powerful personalcomputer There is no simple way to make this a conventionallyaccessible amount of information

How can we access this information Several different methodsare possible First we can work with the information on the internetitself or we can download it onto our own system I prefer the lattermethod as it decreases computational time adds flexibility andenables free rein to experiment with material Before you do thisthough you need to be familiar with the websitersquos ldquoterms of servicerdquoor the legal documentation that outlines how users are allowed to usea website resources or software In this case LAC provides generousnon-commercial reproduction terms information ldquomay be repro-duced in part or in whole and by any means without charge orfurther permission unless otherwise specifiedrdquo51

The CDC web archive index is organized like so the index islocated at wwwepelac-bacgcca100205301iccdcEAlphabetaspwith every subsequent website branching off the cdc folder Forexample one website would be hosted in wwwepelac-bacgcca100205301iccdcABbenevolat the next in hellipcdcabnatureand so on Using wget a free software program you can downloadall websites in a certain structure the program recursively downloadsinformation in a methodical fashion I first used wget to save the

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

cookies (little files kept on your system in this case one that notedthat you had seen the disclaimer about the archival nature of thewebsite) and then with an one-line command in my command lineto download the entire database52 Due to the imposition of a ran-dom wait and bandwidth limit which meant that I would downloadone file every half to one and a half seconds at a maximum of 30kilobytes a second the process took six days In the ProgrammingHistorian 2 I have an article expanding upon wget for humanitiesresearchers

Hypertext itself deserves a brief discussion The basic buildingblock of the web is the link one page is connected to another andpeople navigate through these webs While books and articles forexample are connected to each other by virtue of ideas and concretecitations these must often be inferred online HTML born-digitalsources are different in that the link offers a very concrete file One

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Visualizing the CDC Collection as a Series of Hyperlinks

quick visualization the circular cluster seen above lets us see theunique nature of this information We can go to each web pagescrape the links and see where they take viewers As we see in thebelow figure the CDC collection can be visualized as a series ofhyperlinks (each line) The cluster in the middle is the CDC collec-tion itself whereas the strands on the outside represent externalwebsites Using Mathematica I have generated a dynamic modelyou can move your mouse over each line learn what link it is andthus see where the collection fits into the broader Internet The siteis limited to following two links beyond the first page Through thisapproach we see the interconnection of knowledge and relationshipsbetween different sources At a glimpse we can see suggestions to fol-low other sites and determine if any nodes are shared amongst thevarious websites revealed

Visualization Moving Beyond Word Clouds

Where to begin with a large collection such as the Canadarsquos DigitalCollection corpus As a historian who is primarily interested in thewritten word my first step was to isolate the plaintext throughout thematerial so that I could begin rudimentary data mining and textualanalysis on the corpus For this I approached the CDC collection ontwo distinct levels mdash first the aggregate collection and second divid-ing the collection into each lsquofilersquo or individual website mdash derivingword and phrase data for each Imagine if you could look throughinto an archival collection and get a quick sense of what topics andideas were covered Taking that thought further imagine the utility ifyou could do this with individual archival boxes and even files Witha plethora of born-digital sources this is now a necessary step

On an aggregate level word and phrase data is a way to explorethe information within How often do various words within thenearly eight million words of the collection appear UsingMathematica we can quickly generate a list of words sorted by theirrelative frequency we learn that the word ldquoCanadardquo appears 26429times ldquocollectionsrdquo 11436 and ldquo[L]ouisbourgrdquo an astounding9365 times Reading such a list can quickly wear out a reader and isof minimal visual attractiveness Two quick decisions were made to

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

enhance data usability first the information was ldquocase-foldedrdquo allsent to lower-case this enables us to count the total appearances of aword such as ldquodigitalrdquo regardless of whether or not it appears at thebeginning of a sentence

Second when dealing with word frequency mdash as opposed tophrase frequency mdash ldquostop wordsrdquo should be removed as they werehere These are ldquoextremely common words which would appear to beof little value in helping select documents matching a user needrdquowords such as ldquotherdquo ldquoandrdquo ldquotordquo and so forth They matter forphrases (the example ldquoto be or not to berdquo is comprised entirely ofstop words) but will quickly dominate any word-level visualizationor list53 Other words may need to be added or removed from stoplists depending on the corpus for example if the phrase ldquopage num-berrdquo appeared on every page that would be useless information forthe researcher

The first step then is to use Mathematica to format word fre-quency information into Wordlenet readable format (if we put allof the text into that website it will crash mdash so we need to followthe instructions on that website) we can then draw up a wordcloud of the entire collection Here we can see both the advantagesand disadvantages of static word clouds What can we learn from this

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A Word Cloud of All Words in the CDC Corpus

illustration First without knowing anything about the sites we canimmediately learn that it is dealing with CanadaCanadians that itis a collection with documents in French and English that we aredealing with digitized collections and that there is a fair amount ofregional representation not based upon population (note thatldquoontariordquo is less frequent than both ldquonewfoundlandrdquo ldquonova scotiardquoand ldquoalbertardquo although ldquoqueacutebecrdquo and ldquoquebecrdquo is not terribly promi-nent) Certain kinds of history dominate communities familynames the built form (houses churches schools and so forth) At avery macro level we can learn several characteristics before doing anydeeper investigation

These forms of visualization need to be used with extreme cau-tion however First while visually attractive it leaves a severeamount of ambiguity are ldquoontariordquo and ldquoqueacutebecrdquo equally promi-nent for example or is there a slight divergence Even moreproblematically we cannot learn context from the above picturedoes the word ldquoschoolrdquo refer to built structures or educational expe-riences The same question with applies to ldquochurchrdquo Do mentionsof ldquonewfoundlandrdquo refer to historical experiences there or withexpatriates or in travel narratives Word clouds are useful first stepsand have their place but need to be used with extreme caution Formore scholarly applications tables offer much more precise inter-pretation As the second figure below demonstrates they can bemore helpful when dealing with the individual sites cuttingthrough ambiguity The figure below is a visualization of the web-site ldquopast and presentrdquo Despite the ambiguous title we can quicklylearn what this website contains from this visualization Wordle isthus a useful first step analogous to taking magically taking a quicklook through the side of an archival box We still need more infor-mation however to really make this form of textual analysishelpful To navigate this problem I have created my own dynamicfinding aid program using Mathematica which scales to any born-digital web collection

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Two approaches were taken First a textual representation of wordand phrase frequency allows us to determine the content of a given filerelatively quickly I wrote a short program to extract word frequency(stop words excluded) and phrase frequency For phrases we speak ofn-grams bigrams trigrams quadgrams and fivegrams although onecould theoretically do any higher number A bigram can be a combina-tion of two characters such as lsquobirsquo two syllables or mdash pertinent in ourcase mdash two words An example ldquocanada isrdquo A trigram is three wordsquadgram is four words and a fivegram is five words and so on

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Word Cloud of the File ldquoPast and Presentrdquo

A viewer for n-grams in the CDC Collection

The CDC collection demonstrates this approachrsquos utility Usingmy Mathematica textual CDC browser pictured above I randomlyexplored file 267 the ambiguously named ldquopast and presentrdquo web-site This is one of the largest individual websites 122363 words orabout 493 standard typed pages Such an ambiguous name could beabout almost anything Initially I decided to study word frequency

ldquoldquoalbertardquo 758rdquo ldquoldquolistenrdquo 490rdquo ldquoldquosettlersrdquo 467rdquoldquoldquonewrdquo 454rdquo ldquoldquocanadardquo 388rdquo ldquoldquolandrdquo 362rdquo ldquoldquoreadrdquo328rdquo ldquoldquopeoplerdquo 295rdquo ldquoldquosettlementrdquo 276rdquo ldquoldquochineserdquo268rdquo ldquoldquocanadianrdquo 264rdquo ldquoldquoplacerdquo 255rdquo ldquoldquoheritagerdquo254rdquo ldquoldquoedmontonrdquo 250rdquo ldquoldquoyearsrdquo 238rdquo ldquoldquowestrdquo 236rdquoldquoldquoraymondrdquo 218rdquo ldquoldquoukrainianrdquo 216rdquo ldquoldquocamerdquo 213rdquoldquoldquocalgaryrdquo 209rdquo ldquoldquonamesrdquo 206rdquo ldquoldquoranchrdquo 204rdquoldquoldquoupdatedrdquo 193rdquo ldquoldquohistoryrdquo 190rdquo ldquoldquocommunitiesrdquo183rdquo hellip

Rather than having to visit the website at a glance we can almostimmediately see what the site is about More topics become apparentwhen one looks at trigrams for example in this automatically gen-erated output of generated top phrases

ldquoldquofirstrdquo ldquopeoplerdquo ldquosettlersrdquo 100rdquo ldquoldquotherdquo ldquounitedrdquoldquostatesrdquo 72rdquo ldquoldquoonerdquo ldquoofrdquo ldquotherdquo 59rdquo ldquoldquoofrdquo ldquotherdquoldquowestrdquo 58rdquo ldquoldquosettlersrdquo ldquolastrdquo ldquoupdatedrdquo 56rdquo ldquoldquopeoplerdquoldquosettlersrdquo ldquolastrdquo 56rdquo ldquoldquoalbertansrdquo ldquofirstrdquo ldquopeoplerdquo 56rdquoldquoldquoadventurousrdquo ldquoalbertansrdquo ldquofirstrdquo 56rdquo ldquoldquotherdquo ldquobarrdquoldquourdquo 55rdquo ldquoldquoopeningrdquo ldquoofrdquo ldquotherdquo 54rdquo ldquoldquocommunitiesrdquoldquoadventurousrdquo ldquoalbertansrdquo 54rdquo ldquoldquonewrdquo ldquocommunitiesrdquoldquoadventurousrdquo 54rdquo ldquoldquobarrdquo ldquourdquo ldquoranchrdquo 53rdquo ldquoldquolistenrdquoldquotordquo ldquotherdquo 52rdquo hellip

Through a few clicks we can navigate and determine the basiccontours mdash from a distance mdash of what the site pertains to In aviewer you can use a slider to dynamically see more informationor less information quickly toggling between bigrams trigramsand quadgrams allowing you to quickly hone in on immediately

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

relevant information This can be useful but we need to moveback even further to achieve a distant reading of this born-digitalcollection

If one zooms back out to this very distant level we can extractn-gram data from the entire collection We have two approachesfirst aggregate appearances of a given term (if ldquoaboriginalrdquo appears457 times then n=457) or second relative appearances of a giventerm (if ldquoaboriginalrdquo appears in 45 of 384 sites then n=45384 or01171875) In my other work on music lyrics the latter optionenables you to control for the lsquochorusrsquo effect as the former can be drowned out by one song with repetitive words Given thenature of this collection I elected for the former approach Whatcan we learn from trigrams for example After an initial set of trigrams pertaining to the Louisbourg digitization project whichhas boilerplate language on each page as well as common ldquostop-wordrdquo equivalents of phrases (especially pronounced in French) we come across the first set of meaningful phrases excerptedbelow

ldquoexperiencerdquo ldquoinrdquo ldquotherdquo 919 ldquotransferredrdquo ldquotordquo ldquotherdquo918 ldquofundedrdquo ldquobyrdquo ldquotherdquo 918 ldquoofrdquo ldquotherdquo ldquoprovincialrdquo915 ldquothisrdquo ldquowebrdquo ldquositerdquo 915 ldquo409rdquo ldquoregistryrdquo ldquo2rdquo914 ldquoarerdquo ldquonordquo ldquolongerrdquo 914 ldquogouvernementrdquo ldquodurdquo ldquocanadardquo 910 ldquouserdquo ldquothisrdquo ldquolinkrdquo 887ldquopagerdquo ldquoorrdquo ldquoyourdquo 887 ldquofollowingrdquo ldquopagerdquo ldquoorrdquo 887 ldquoautomaticallyrdquo ldquotransferredrdquo ldquotordquo 887 ldquoberdquoldquoautomaticallyrdquo ldquotransferredrdquo 887 ldquowillrdquo ldquoberdquo ldquoautomaticallyrdquo 887 ldquoindustryrdquo ldquoyourdquo ldquowillrdquo 887ldquomultimediardquo ldquoindustryrdquo ldquoyourdquo 887 ldquopracticalrdquo ldquoexperiencerdquo ldquoinrdquo 887 ldquogainedrdquo ldquopracticalrdquo ldquoexperiencerdquo887 ldquotheyrdquo ldquogainedrdquo ldquopracticalrdquo 887 ldquowhilerdquo ldquotheyrdquoldquogainedrdquo 887 ldquocanadiansrdquo ldquowhilerdquo ldquotheyrdquo 887ldquoyoungrdquo ldquocanadiansrdquo ldquowhilerdquo 887 ldquoshowcaserdquo ldquoofrdquoldquoworkrdquo 887 ldquoexcellentrdquo ldquoshowcaserdquo ldquoofrdquo 887 hellip

Here we see an array of common language funding informa-tion link information project information spread across over 887

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

websites Reading down we see information about governments jobexperience project information initiatives re-structuring and soforth At a glance we can see what language is common to all web-sites as opposed to unique information Moving down we find themore common but less universal ldquothe history ofrdquo for example ldquoher-itage community foundationrdquo libraries and so forth By the end ofthe list we can move into the unique or very rare phrases that appearonly once twice or three times

Moving beyond word and phrase frequency there are a fewmore sophisticated off-the-shelf options that can help us make senseof large arrays of unstructured data Topic modeling is among themost developed as Princetonrsquos David M Blei (a Computer Scienceprofessor) describes in a general introduction to the topic

While more and more texts are available online we simplydo not have the human power to read and study them toprovide the kind of [traditional] browsing experiencedescribed above To this end machine learning researchershave developed probabilistic topic modeling a suite of algo-rithms that aim to discover and annotate large archives ofdocuments with thematic information Topic modelingalgorithms are statistical methods that analyze the words ofthe original texts to discover the themes that run throughthem how those themes are connected to each other andhow they change over time hellip Topic modeling enables usto organize and summarize electronic archives at a scalethat would be impossible by human annotation55

Using the aforementioned SEASR Meandre workbench (or viaMALLET itself and a command-line interface) we can quickly beginto run our born-digital information through the topic-modelingalgorithm The results are notable and come in two formats a wordcloud visualization as well as an XML file containing a line-by-linebreakdown of each topic Taking the aforementioned file 267 (ldquopastand presentrdquo) we can begin to get a sense of the topics this large web-site discusses

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

At a glance we can learn about some of the main themes the struc-ture of Alberta (Alberta towns bridges lakes) the geographic shapeof the area (life land lands areas links grass and interestingly lan-guage) the stories (story knight archives but also interestinglyknight and wife) and so forth The word cloud gives us a sense ofthe most relevant terms that define each topic but we also have a3200-line XML file broken into topics which lets us further refineour inquiry

A last set of examples will help us move from the aggregate levelof hundreds of thousands of lines of text to the individual line inessence creating our own search engine to navigate this collection

There are a few good reasons why we should not use Google todo all of this in our own work First we do not know how Googleworks precisely It is a proprietary black box While the original algo-rithm is available the actual methods that go into determining theranking of search terms is a trade secret Second Google is optimizedfor a number of uses mainly to deliver the most relevant informa-tion to the user Historians and archivists are looking often forspecific information within old websites and it behooves us to exam-ine them through the lens of more specialized tools With thesehesitations in mind I have sought to develop my own rudimentarysearch engine for this born-digital collection

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A two-pronged approach can take us to relevant informationquickly In the following example I am carrying out a project onblack history We quickly search all of the CDC word frequency filesfor the occurrence of the word ldquofreedomrdquo ldquoblackrdquo and ldquolibertyrdquo(these terms of course require some background knowledge thesearch is shaped by the user) obtain a list and then map it onto achart in order to quickly visualize their locations in the collection Forexample in the graph above we see a visualization of a given term inthis case our three search terms As one moves the mouse over eachbar the frequency data for a given term appears We can also zoomand enhance this graph to use it as we wish A demonstration of thiscan be found online at wwwianmilliganca20130118cdc-tool-1

We can then move down into the data In the example aboveldquo047rdquo labeled ldquoblackgoldtxtrdquo contains 31 occurrences of the wordldquoontariordquo Moving to my Mathematica keyword-in-context routinewe can see where these occurrences of the word appear

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

A snapshot of the dynamic finding aid tool

We see that this website focuses mostly on southwestern Ontariowhere many oil fields are If we move forward we can see that it per-tains to the nineteenth century a particular region as well assubsequent memorialization activities (mention of a museum forexample) All this can be done from within the textual client asopposed to going out into individual websites Broad contoursemerge from this distant reading but they also facilitate deeper read-ing by allowing us to promptly realize critical sources

Another model through which we can visualize documents isthrough a viewer primarily focusing on term document frequencyand inverse frequency terms discussed below To prepare this mater-ial we are combining our earlier data about word frequency (that theword ldquoexperiencerdquo appears 919 times perhaps) with aggregate infor-mation about frequency in the entire corpus (ie the wordldquoexperiencerdquo appears 1766 times) The number of times that a termappears in a given website can be defined as tf (term frequency)Another opportunity is inverse document frequency where we canlearn what words are most distinct and unique in a given documentThe formula for this is the oft-cited tf-idf a staple of informationretrieval theory or tf times the logarithm of the number of docu-ments divided into the number of documents where the given termappears At a glance then we can see some of the unique words con-tained within a website if we select ldquoinverse document frequencyrdquo

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

At a very quick glance one can see the major words that make this aunique document This technique can be used on any textual docu-ment

Off-the-shelf tools can facilitate a deeper yet distant reading ofindividual websites as well especially the suite of research tools onwebsites such as Many Eyes and Voyant Using these tools howeverrequires that one puts individual data into each separate programrather than quickly navigating between them This stems from themassive amount of files and data involved Many Eyes limits files to5MB and Voyant has no hard and fast rule but can struggle withlarger files

This analysis was carried out on my personal computer55 Largerdata sets may necessitate a corresponding increase in computationalpower bringing social sciences and humanities researchers into con-tact with High Performance Computing (HPC) centres or consortiaThese extremely powerful computers provide researchers with com-puting power orders of magnitude above that of personal computersallowing results in hours instead of the weeks or months it might

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

take to process on a smaller system56 This has particular applicationin the basic processing of a very large web archive There are limita-tions however jobs must be submitted into queues in advancerather than on-site processing and perhaps most importantly HPCcentres prefer to run highly optimized code to minimize usage oftheir very expensive and scarcely allocated equipment As this level ofcoding expertise is often beyond humanities researchers researchteams must be formed

Humanities researchers would do well to familiarize themselveswith local HPC options many of which are more than happy towork on new and engaging questions These are accessible optionsin Ontario for example all but one university belongs to a HPCconsortium57 Yet as personal computing continues to dramaticallyincrease in speed and as storage prices continue to plummet textualanalysis is increasingly feasible on high-quality office systems An ini-tial exposure can generate questions and problems that if the needrears itself can subsequently be taken to the supercomputer

This case study is intended to provide one model for distantreading and the generation of finding aids in the interest of showingsome of the potential for working with born-digital sources It isintended to begin or jumpstart a conversation around the resourcesand skills necessary for the next generation A seemingly insur-mountable amount of information contained across over 73000files (as in the CDC collection) can be quickly navigated and iso-lated for researcher convenience

Conclusion Digital Literacy for the Next Generation ofHistorians

As a discipline I would like to suggest one path forward prepare thenext generation of historians at least some of them to deal with thedeluge of born-digital sources Fundamentally the challenge of born-digital sources is a pedagogical one For existing historians this is anissue of continual education for future historians an issue of profes-sional training (the annual Digital Humanities Summer Instituteheld at the University of Victoria each June is worth considering asan introduction to many digital concepts) Some off-the-shelf tools

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

can serve as gateways to deeper engagement notably the VoyantTools suite (wwwvoyant-toolsorg) and perhaps subsequently thelessons contained in the Programming Historian 2 That said digitalliteracy courses should become part of the essential pantheon of his-torical curricula This should not be mandatory but instead taughtalongside and within traditional historiography and research meth-ods courses In an era of diminishing resources and attention paid toknowledge translation and relevancy digital literacy courses are alsowell positioned to provide graduates with concrete portfolios andactionable skills If we want to be prepared to use digital method-ologies to study the post-1990 world we will need ready historians

Such a course should have six basic priorities First we must pro-vide historiographical context for new students on whatcomputational methods have been tried before How have quantita-tive and qualitative methods differed yet ultimately complementedeach other Second students in this course would need to develop anunderstanding of the digital basics cloud computing effectivebackup and versioning digital security and an understanding of howthe Internet is organized Third for effective cross-departmental andsustained engagement with digital history students must to learn howto digitally navigate conventional sources notable skill sets wouldinclude a familiarity with citation management software databasesand the possibility of using optical character recognition on a varietyof born-paper sources subsequently digitized through scanners orcameras This must also include a grasp of the scope and significanceof large digital depositories including Google Books the InternetArchive and others mentioned earlier Fourth students must learnhow to process large amounts of digital information this includes off-the-shelf visualization tools material discussed here and some of thephilosophical and methodological implications of a distance readingFifth students can learn basic programming through the ProgrammingHistorian Finally and perhaps most importantly students need anunderstanding of various ways to openly disseminate material on the web whether through blogging micro-blogging publishing webpages and so forth it is important to enable the great work that we do in the university to make its mark on the world Throughthis students will also come away from history departments with

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

comprehensive and marketable digital portfolios Such a coursewould focus more on homework than reading with special emphasisplaced on deliverable content such as websites interactive modelsand broadly implementable databases

Digital history and the readiness to work with born-digitalsources cannot stem out of pedagogy alone however Institutionalsupport is necessary While based on anecdotal evidence it appearsthat some university administrators have begun to embrace the shifttowards the digital humanities At the federal level SSHRC has iden-tified the Digital Economy as one of its priority areas58 Yet when itcomes to evaluation doctoral and post-doctoral researchers do nothave a ldquodigital humanitiesrdquo option open to them in terms of assessingcommittees leading to fears that their projects might be lost inbetween disciplinary areas of focus59 In light of the impending crisisof born-digital sources historians should encourage these initiatives

These sorts of digital projects do not always lend themselveswell to traditional publication venues The examples above from theCDC case study for example are best viewed as dynamically mal-leable models this will be one challenge as historians grapple withnew publication models This article argues first why digitalmethodologies are necessary and second how we can realize aspectsof this call The Internet is 20 years old and before historians realizeit we will be faced with the crisis of born-digital information over-load If history is to continue as the leading discipline inunderstanding the social and cultural past decisive movementtowards the digital humanities is necessary When the sea comes inwe must be ready to swim

IAN MILLIGAN is an assistant professor of Canadian history at theUniversity of Waterloo He is a founding co-editor of ActiveHistorycaan editor-at-large of the Programming Historian 2 and has written sev-eral articles in the fields of Canadian youth digital and labour history

IAN MILLIGAN est professeur adjoint en histoire canadienne agrave laUniversity of Waterloo Il est le fondateur et codirecteur du site

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Internet ActiveHistoryca ainsi qursquoun eacutediteur du site ProgrammingHistorian 2 Il a eacutecrit plusieurs articles au sujet de la jeunesse cana-dienne de la numeacuterisation et de lrsquohistoire des travailleurs

Endnotes

1 The question of when the past becomes a ldquohistoryrdquo to be professionallystudied is a tricky one Yet it is worth noting that the still-definitivework on the Baby Boom Generation appeared in 1996 with DougOwram Born at the Right Time A History of the Baby Boom Generationin Canada (Toronto University of Toronto Press 1996) This actuallyfollowed the earlier scholarly monograph Cyril Levitt Children ofPrivilege Student Revolt in the Sixties (Toronto University of TorontoPress 1984) and a host of other laudable works As academicdebatesworks on the 1960s now suggest the past becomes historysooner than we often expect

2 A critical point advanced in the field of literary criticism and theory byStephen Ramsay Reading Machines Towards an Algorithmic Criticism(Urbana University of Illinois Press 2011)

3 See ldquoIdleNoMore Timelinerdquo ongoingwwwidlenomoremakookcatimeline44 ltviewed 18 January 2013gt

4 These issues have been discussed elsewhere albeit in either a moreabridged form or with a focus on digitized historical sources A notableexception is Daniel J Cohen and Roy Rosenzweig Digital History AGuide to Gathering Preserving and Presenting the Past on the Web(Philadelphia University of Pennsylvania Press 2005) specifically theirchapter on ldquoCollecting History onlinerdquo Their focus is on the preservation-ist and archival challenges however Important notes are also found inDan Cohen ldquoHistory and the Second Decade of the Webrdquo RethinkingHistory 8 no 2 (June 2004) 293ndash301 also available online atwwwchnmgmueduessays-on-history-new-mediaessaysessayid=34

5 Figure taken from Jose Martinez ldquoTwitter CEO Dick Costolo Reveals Staggering Number of Tweets Per Dayrdquo ComplexTech wwwcomplexcomtech201210twitter-ceo-dick-costolo-reveals-staggering-number-of-tweets-per-day ltviewed 6 January 2013gtFurther more while a tweet itself is only 140 characters every tweetcomes with over 40 lines of metadata date information author biogra-phy and location information when the account was created the usersrsquofollowing and follower numbers country data and so forth This is aremarkable treasure trove of information for future historians but also apotential source of information overload See Sarah Perez ldquoThis is What

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

a Tweet Looks Likerdquo ReadWriteWeb (19 April 2010) wwwreadwritewebcomarchivesthis_is_what_a_tweet_looks_likephpltviewed 6 January 2013gt

6 This is further discussed in Ian Foster ldquoHow Computation ChangesResearchrdquo in Switching Codes Thinking Through Digital Technology inthe Humanities and the Arts eds Thomas Bartscherer and RoderickCoover (Chicago University of Chicago Press 2011) esp 22ndash4

7 Franco Moretti Graphs Maps Trees Abstract Models for Literary History(New York Verso 2005) 3ndash4

8 While not of prime focus here the longue dureacutee deserves definition Itstands opposite the history of events of instances covering instead avery long time span in an interdisciplinary social science framework In along essay Braudel noted that one can then see both structural crises ofeconomies and structures that constrain human society and develop-ment For more see Fernand Braudel ldquoHistory and the Social SciencesThe Longue Dureacuteerdquo in Fernand Braudel On History trans SarahMatthews (Chicago University of Chicago Press 1980) 27 The essaywas originally printed in the Annales ESC no 4 (OctoberndashDecember1958)

9 A great overview can be found at ldquoText Analysisrdquo Tooling Up for DigitalHumanities 2011 wwwtoolingupstanfordedupage_id=981 ltviewed19 December 2012gt

10 Jean-Baptiste Michel Yuan Kui Shen Aviva Presser Aiden Adrian VeresMatthew K Gray The Google Books Team Joseph P Pickett DaleHoiberg Dan Clancy Peter Norvig Jon Orwant Steven Pinker MartinA Nowak Erez Lieberman Aiden ldquoQuantitative Analysis of CultureUsing Millions of Digitized Booksrdquo Science 331 (14 January 2011)176ndash82 The tool is publicly accessible at ldquoGoogle Books NgramViewerrdquo Google Books 2010 wwwbooksgooglecomngrams ltviewed11 January 2013gt

11 Anthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

12 Comment by Jean-Baptiste Michel and Erez Lieberman Aiden onAnthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

13 For more information see the Canadian Century Research InfrastructureProject website at wwwccriuottawacaCCRIHomehtml and the Programme de recherche en deacutemographie historique at wwwgeneologyumontrealcafrleprdhhtm

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

14 ldquoWith Criminal Intentrdquo CriminalIntentorg wwwcriminalintentorgltviewed 20 December 2012gt

15 For an overview see ldquoAward Recipients mdash 2011rdquo Digging into DataWebpage wwwdiggingintodataorgHomeAwardRecipients2011tabid185Defaultaspx ltviewed 20 December 2012gt

16 One can do no better than the completely free Programming Historianwhich uses entirely free and open-source programming tools to learn thebasics of computational history See The Programming Historian 2 2012-Present wwwprogramminghistorianorg ltviewed 11 January 2013gt

17 Michael Katz The People of Hamilton Canada West Family and Class ina Mid-Nineteenth-Century City (Cambridge MA Harvard UniversityPress 1975) Also see A Gordon Darroch and Michael D OrnsteinldquoEthnicity and Occupational Structure in Canada in 1871 The VerticalMosaic in Historical Perspectiverdquo Canadian Historical Review 61 no 3(1980) 305ndash33

18 Ian Anderson ldquoHistory and Computingrdquo Making History The ChangingFace of the Profession in Britain 2008 wwwhistoryacukmakinghistoryresourcesarticleshistory_and_computinghtml ltviewed 20 December2012gt

19 Christopher Collins ldquoText and Document Visualizationrdquo TheHumanities and Technology Camp Greater Toronto Area VisualizationBootcamp University of Ontario Institute of Technology Oshawa Ont21 October 2011

20 The quotations by Luther and Poe are commonly used to make thispoint For the best example see Clay Shirky Cognitive SurplusCreativity and Generosity in a Connected Age Google eBook Penguin2010 For Mumford see Lewis Mumford The Myth of the Machine vol2 The Pentagon of Power (New York Harcourt Brace 1970) 182 asquoted in Gleick The Information A History A Theory A Flood (NewYork Pantheon 2011) 404

21 Adam Ostrow ldquoCisco CRS-3 Download Library of Congress in JustOver One Secondrdquo Mashable (9 March 2010) accessed onlinewwwmashablecom20100309cisco-crs-3

22 A note on data size is necessary The basic unit of information is a ldquobyterdquo (8 bits) which is roughly a single character A kilobyte (KB) is athousand bytes a megabyte (MB) a million a gigabyte (GB) a billion aterabyte (TB) a trillion and a petabyte (PB) a quadrillion Beyond thisthere are exabytes (EB) zettabytes (ZB) and yottabytes (YB)

23 These figures can of course vary depending on compression methodschosen Their math was based on an average of 8MB per book if it wasscanned at 600dpi and then compressed There are 26 million books inthe Library of Congress For more see Mike Ashenfelder ldquoTransferringlsquoLibraries of Congressrsquo of Datardquo Library of Congress Digital Preservation

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Blog 11 July 2011wwwblogslocgovdigitalpreservation201107transferring-libraries-of-congress-of-data ltviewed 11 January 2013gt

24 Gleick 232 25 ldquoAbout the Library of Congress Web Archivesrdquo Library of Congress Web

Archiving FAQs Undated wwwlocgovwebarchivingfaqhtmlfaqs_02ltviewed 11 January 2013gt

26 ldquo10000000000000000 Bytes Archivedrdquo Internet Archive Blog 26October 2012 wwwblogarchiveorg2012102610000000000000000-bytes-archivedltviewed 19 December 2012gt

27 Jean-Baptiste Michel et al 176ndash82 Published Online Ahead of Print16 December 2010 wwwsciencemagorgcontent3316014176ltviewed 29 July 2011gt See also Scott Cleland ldquoGooglersquos PiracyLiabilityrdquo Forbes 9 November 2011 wwwforbescomsitesscottcle-land20111109googles-piracy-liability ltviewed 11 January 2013gt

28 ldquoGovernment of Canada Web Archiverdquo Library and Archives Canada 17October 2007 wwwcollectionscanadagccawebarchivesindex-ehtmlltviewed 11 January 2013gt

29 For background see Bill Curry ldquoVisiting Library and Archives inOttawa Not Without an Appointmentrdquo Globe and Mail 1 May 2012wwwtheglobeandmailcomnewspoliticsottawa-notebookvisiting-library-and-archives-in-ottawa-not-without-an-appointmentarticle2418960 ltviewed 22 May 2012gt and the announcement ldquoLAC Begins Implementation of New Approach to service Deliveryrdquo Library and Archives Canada February 2012wwwcollectionscanadagccawhats-new013-560-ehtml ltviewed 22May 2012gt For professional responses see the ldquoCHArsquos comment onNew Directions for Library and Archives Canadardquo CHA Website March2012 wwwcha-shccaenNews_39items19html ltviewed 19December 2012gt and ldquoSave Library amp Archives Canadardquo CanadianAssociation of University Teachers launched AprilMay 2012 wwwsavelibraryarchivesca ltviewed 19 December 2012gt

30 Ian Milligan ldquoThe Smokescreen of lsquoModernizationrsquo at Library andArchives Canadardquo ActiveHistoryca 22 May 2012wwwactivehistoryca201205the-smokescreen-of-modernization-at-library-and-archives-canada ltviewed 19 December 2012gt

31 ldquoSave LAC May 2012 Campaign Updaterdquo SaveLibraryArchivesca May2012 wwwsavelibraryarchivescaupdate-2012-05aspx ltviewed 19December 2012gt

32 The challenges and opportunities presented by such projects is extensively discussed in Daniel J Cohen ldquoThe Future of Preserving the Pastrdquo CRM The Journal of Heritage Stewardship 2 no 2

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

(Summer 2005) 6ndash19 also available freely online at the RoyRosenzweig Center for History and New Mediawwwchnmgmueduessays-on-history-new-mediaessaysessayid=39For more see ldquoSeptember 11 Digital Archive Saving the Histories of September 11 2001rdquo Centre for History and New Media andAmerican Social History Project Center for Media and Learningwww911digitalarchiveorg ltviewed 11 January 2013gt ldquoHurricaneDigital Memory Bank Collecting and Preserving the Stories of Katrina and Ritardquo Center for History and New Media wwwhurricanearchiveorg ltviewed 11 January 2013gt and ldquoOccupyArchive Archiving the Occupy Movements from 2011rdquo Centre forHistory and New Media wwwoccupyarchiveorg ltviewed 11 January2013gt

33 Daniel Cohen and Roy Rosenzweig Digital History A Guide to Gathering Preserving and Presenting the Past on the Web(Philadelphia Temple University Press 2005) chapter on ldquoPreservingDigital Historyrdquo The chapter is available fully online atwwwchnmgmuedudigitalhistorypreserving5php ltviewed 19December 2012gt

34 For an overview see ldquoPreserving Our Digital Heritage The NationalDigital Information Infrastructure and Preservation Program 2010 Report a Collaborative Initiative of the Library of CongressrdquoDigitalpreservationgov wwwdigitalpreservationgovmultimediadocumentsNDIIPP2010Report_Postpdf ltviewed 19 December 2012gt

35 Steve Meloan ldquoNo Way to Run a Culturerdquo Wired 13 February 1998available online via the Internet Archivewwwwebarchiveorgweb20000619001705httpwwwwiredcomnewsculture012841030100html ltviewed 19 December 2012gt

36 Some of this has been discussed in greater depth on my website See theseries beginning with Ian Milligan ldquoWARC Files A Challenge forHistorians and Finding Needles in Haystacksrdquo ianmilliganca 12December 2012 wwwianmilliganca20121212warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks ltviewed 20 December2012gt

37 Unstructured data as opposed to structured data is important for histo-rians but brings additional challenges Some projects have producedstructured data novels marked up using XML encoding for example soone can immediately see place names dates character names and soforth This makes it much easier for a computer to read and understand

38 See William Noblett ldquoDigitization A Cautionary Talerdquo New Review ofAcademic Librarianship 17 (2011) 3 Tobias Blanke Michael Bryantand Mark Hedges ldquoOcropodium Open Source OCR for Small-ScaleHistorical Archivesrdquo Journal of Information Science 38 no 1 (2012) 83

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 21: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

textual analysis programs found elsewhere on the web The mostcomprehensive gathering of these tools can be found with theSoftware Environment for the Advancement of Scholarly Research(SEASR) and its Meandre workbench Meandre which can beinstalled as a local server on your system or mdash in a best case scenariomdash hosted by an institutional IT department allows you to tap intoa variety of workflows to analyze data48 One workflow mentionedin this article is the topic modeling toolkit found within the MachineLearning for Language Toolkit (MALLET) package which clustersunstructured data into ldquotopicsrdquo which appear together giving a quicksense of a documentrsquos structure49 Other workflows found withinSEASR include Dunningrsquos log-likelihood statistic which compares aselected document to a broader reference corpus For example youcould compare the lyrics that appear in 1968 against all lyrics in thepostwar era and learn what makes 1968 most unique50 sentimentanalyses which attempt to plot changes in emotional words andphrases over time and a variety of entity extraction algorithms(which find ldquopeoplerdquo ldquoplacesrdquo and so forth in a text) Corpuses thatwere previously too big to extract meaningful information from cannow be explored relatively quickly

As befits a technical side to history all of the freely availabletools discussed briefly here (notably wget SEASR and MALLET)are discussed in more technical terms in the peer-reviewedProgramming Historian 2 or on my own blog at ianmilliganca I haveelected to leave them online rather than provide commands in thisarticle they require some technical knowledge and directly workingfrom the web is quicker than here Please consider the following sec-tion an applied version of some of the lessons you can learn throughthese processes

From Web Collection to Dynamic Finding Aids

Let us briefly return to our archival metaphor Todayrsquos historiandealing with conventional sources often has to physically travel to anarchive requisition boxes perhaps take photographs of materials iftime is tight and otherwise take notes With born-digital sourceshowever the formula is reversed Our sources can come to us In a

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

few keystrokes we can bring an unparalleled quantity of informationto our home or work computers With Canadarsquos Digital Collectionsa single command line entry brought the entire archive onto my sys-tem it took days to process the information and I had to leave mycomputer on but it did not require any active user engagement

How big is the CDC collection in terms of sheer informationThe plaintext without any HTML markup is approximately7805135 words If each page contained 250 words it would be theequivalent of a 31220 page essay or 1000 dissertations However itis even more complicated than just reading 1000 dissertations Theinformation is spread across 73585 separate HTML files themselvespart of a much broader corpus of 278785 files Even the sheer sizeof the database requires big data methodologies an aggregate size of726 GB 360 MB of plaintext with HTML encoding reduced to alsquomerersquo 50MB if all encoding is removed A 50MB plain text docu-ment can quickly slow down even a relatively powerful personalcomputer There is no simple way to make this a conventionallyaccessible amount of information

How can we access this information Several different methodsare possible First we can work with the information on the internetitself or we can download it onto our own system I prefer the lattermethod as it decreases computational time adds flexibility andenables free rein to experiment with material Before you do thisthough you need to be familiar with the websitersquos ldquoterms of servicerdquoor the legal documentation that outlines how users are allowed to usea website resources or software In this case LAC provides generousnon-commercial reproduction terms information ldquomay be repro-duced in part or in whole and by any means without charge orfurther permission unless otherwise specifiedrdquo51

The CDC web archive index is organized like so the index islocated at wwwepelac-bacgcca100205301iccdcEAlphabetaspwith every subsequent website branching off the cdc folder Forexample one website would be hosted in wwwepelac-bacgcca100205301iccdcABbenevolat the next in hellipcdcabnatureand so on Using wget a free software program you can downloadall websites in a certain structure the program recursively downloadsinformation in a methodical fashion I first used wget to save the

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

cookies (little files kept on your system in this case one that notedthat you had seen the disclaimer about the archival nature of thewebsite) and then with an one-line command in my command lineto download the entire database52 Due to the imposition of a ran-dom wait and bandwidth limit which meant that I would downloadone file every half to one and a half seconds at a maximum of 30kilobytes a second the process took six days In the ProgrammingHistorian 2 I have an article expanding upon wget for humanitiesresearchers

Hypertext itself deserves a brief discussion The basic buildingblock of the web is the link one page is connected to another andpeople navigate through these webs While books and articles forexample are connected to each other by virtue of ideas and concretecitations these must often be inferred online HTML born-digitalsources are different in that the link offers a very concrete file One

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Visualizing the CDC Collection as a Series of Hyperlinks

quick visualization the circular cluster seen above lets us see theunique nature of this information We can go to each web pagescrape the links and see where they take viewers As we see in thebelow figure the CDC collection can be visualized as a series ofhyperlinks (each line) The cluster in the middle is the CDC collec-tion itself whereas the strands on the outside represent externalwebsites Using Mathematica I have generated a dynamic modelyou can move your mouse over each line learn what link it is andthus see where the collection fits into the broader Internet The siteis limited to following two links beyond the first page Through thisapproach we see the interconnection of knowledge and relationshipsbetween different sources At a glimpse we can see suggestions to fol-low other sites and determine if any nodes are shared amongst thevarious websites revealed

Visualization Moving Beyond Word Clouds

Where to begin with a large collection such as the Canadarsquos DigitalCollection corpus As a historian who is primarily interested in thewritten word my first step was to isolate the plaintext throughout thematerial so that I could begin rudimentary data mining and textualanalysis on the corpus For this I approached the CDC collection ontwo distinct levels mdash first the aggregate collection and second divid-ing the collection into each lsquofilersquo or individual website mdash derivingword and phrase data for each Imagine if you could look throughinto an archival collection and get a quick sense of what topics andideas were covered Taking that thought further imagine the utility ifyou could do this with individual archival boxes and even files Witha plethora of born-digital sources this is now a necessary step

On an aggregate level word and phrase data is a way to explorethe information within How often do various words within thenearly eight million words of the collection appear UsingMathematica we can quickly generate a list of words sorted by theirrelative frequency we learn that the word ldquoCanadardquo appears 26429times ldquocollectionsrdquo 11436 and ldquo[L]ouisbourgrdquo an astounding9365 times Reading such a list can quickly wear out a reader and isof minimal visual attractiveness Two quick decisions were made to

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

enhance data usability first the information was ldquocase-foldedrdquo allsent to lower-case this enables us to count the total appearances of aword such as ldquodigitalrdquo regardless of whether or not it appears at thebeginning of a sentence

Second when dealing with word frequency mdash as opposed tophrase frequency mdash ldquostop wordsrdquo should be removed as they werehere These are ldquoextremely common words which would appear to beof little value in helping select documents matching a user needrdquowords such as ldquotherdquo ldquoandrdquo ldquotordquo and so forth They matter forphrases (the example ldquoto be or not to berdquo is comprised entirely ofstop words) but will quickly dominate any word-level visualizationor list53 Other words may need to be added or removed from stoplists depending on the corpus for example if the phrase ldquopage num-berrdquo appeared on every page that would be useless information forthe researcher

The first step then is to use Mathematica to format word fre-quency information into Wordlenet readable format (if we put allof the text into that website it will crash mdash so we need to followthe instructions on that website) we can then draw up a wordcloud of the entire collection Here we can see both the advantagesand disadvantages of static word clouds What can we learn from this

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A Word Cloud of All Words in the CDC Corpus

illustration First without knowing anything about the sites we canimmediately learn that it is dealing with CanadaCanadians that itis a collection with documents in French and English that we aredealing with digitized collections and that there is a fair amount ofregional representation not based upon population (note thatldquoontariordquo is less frequent than both ldquonewfoundlandrdquo ldquonova scotiardquoand ldquoalbertardquo although ldquoqueacutebecrdquo and ldquoquebecrdquo is not terribly promi-nent) Certain kinds of history dominate communities familynames the built form (houses churches schools and so forth) At avery macro level we can learn several characteristics before doing anydeeper investigation

These forms of visualization need to be used with extreme cau-tion however First while visually attractive it leaves a severeamount of ambiguity are ldquoontariordquo and ldquoqueacutebecrdquo equally promi-nent for example or is there a slight divergence Even moreproblematically we cannot learn context from the above picturedoes the word ldquoschoolrdquo refer to built structures or educational expe-riences The same question with applies to ldquochurchrdquo Do mentionsof ldquonewfoundlandrdquo refer to historical experiences there or withexpatriates or in travel narratives Word clouds are useful first stepsand have their place but need to be used with extreme caution Formore scholarly applications tables offer much more precise inter-pretation As the second figure below demonstrates they can bemore helpful when dealing with the individual sites cuttingthrough ambiguity The figure below is a visualization of the web-site ldquopast and presentrdquo Despite the ambiguous title we can quicklylearn what this website contains from this visualization Wordle isthus a useful first step analogous to taking magically taking a quicklook through the side of an archival box We still need more infor-mation however to really make this form of textual analysishelpful To navigate this problem I have created my own dynamicfinding aid program using Mathematica which scales to any born-digital web collection

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Two approaches were taken First a textual representation of wordand phrase frequency allows us to determine the content of a given filerelatively quickly I wrote a short program to extract word frequency(stop words excluded) and phrase frequency For phrases we speak ofn-grams bigrams trigrams quadgrams and fivegrams although onecould theoretically do any higher number A bigram can be a combina-tion of two characters such as lsquobirsquo two syllables or mdash pertinent in ourcase mdash two words An example ldquocanada isrdquo A trigram is three wordsquadgram is four words and a fivegram is five words and so on

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Word Cloud of the File ldquoPast and Presentrdquo

A viewer for n-grams in the CDC Collection

The CDC collection demonstrates this approachrsquos utility Usingmy Mathematica textual CDC browser pictured above I randomlyexplored file 267 the ambiguously named ldquopast and presentrdquo web-site This is one of the largest individual websites 122363 words orabout 493 standard typed pages Such an ambiguous name could beabout almost anything Initially I decided to study word frequency

ldquoldquoalbertardquo 758rdquo ldquoldquolistenrdquo 490rdquo ldquoldquosettlersrdquo 467rdquoldquoldquonewrdquo 454rdquo ldquoldquocanadardquo 388rdquo ldquoldquolandrdquo 362rdquo ldquoldquoreadrdquo328rdquo ldquoldquopeoplerdquo 295rdquo ldquoldquosettlementrdquo 276rdquo ldquoldquochineserdquo268rdquo ldquoldquocanadianrdquo 264rdquo ldquoldquoplacerdquo 255rdquo ldquoldquoheritagerdquo254rdquo ldquoldquoedmontonrdquo 250rdquo ldquoldquoyearsrdquo 238rdquo ldquoldquowestrdquo 236rdquoldquoldquoraymondrdquo 218rdquo ldquoldquoukrainianrdquo 216rdquo ldquoldquocamerdquo 213rdquoldquoldquocalgaryrdquo 209rdquo ldquoldquonamesrdquo 206rdquo ldquoldquoranchrdquo 204rdquoldquoldquoupdatedrdquo 193rdquo ldquoldquohistoryrdquo 190rdquo ldquoldquocommunitiesrdquo183rdquo hellip

Rather than having to visit the website at a glance we can almostimmediately see what the site is about More topics become apparentwhen one looks at trigrams for example in this automatically gen-erated output of generated top phrases

ldquoldquofirstrdquo ldquopeoplerdquo ldquosettlersrdquo 100rdquo ldquoldquotherdquo ldquounitedrdquoldquostatesrdquo 72rdquo ldquoldquoonerdquo ldquoofrdquo ldquotherdquo 59rdquo ldquoldquoofrdquo ldquotherdquoldquowestrdquo 58rdquo ldquoldquosettlersrdquo ldquolastrdquo ldquoupdatedrdquo 56rdquo ldquoldquopeoplerdquoldquosettlersrdquo ldquolastrdquo 56rdquo ldquoldquoalbertansrdquo ldquofirstrdquo ldquopeoplerdquo 56rdquoldquoldquoadventurousrdquo ldquoalbertansrdquo ldquofirstrdquo 56rdquo ldquoldquotherdquo ldquobarrdquoldquourdquo 55rdquo ldquoldquoopeningrdquo ldquoofrdquo ldquotherdquo 54rdquo ldquoldquocommunitiesrdquoldquoadventurousrdquo ldquoalbertansrdquo 54rdquo ldquoldquonewrdquo ldquocommunitiesrdquoldquoadventurousrdquo 54rdquo ldquoldquobarrdquo ldquourdquo ldquoranchrdquo 53rdquo ldquoldquolistenrdquoldquotordquo ldquotherdquo 52rdquo hellip

Through a few clicks we can navigate and determine the basiccontours mdash from a distance mdash of what the site pertains to In aviewer you can use a slider to dynamically see more informationor less information quickly toggling between bigrams trigramsand quadgrams allowing you to quickly hone in on immediately

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

relevant information This can be useful but we need to moveback even further to achieve a distant reading of this born-digitalcollection

If one zooms back out to this very distant level we can extractn-gram data from the entire collection We have two approachesfirst aggregate appearances of a given term (if ldquoaboriginalrdquo appears457 times then n=457) or second relative appearances of a giventerm (if ldquoaboriginalrdquo appears in 45 of 384 sites then n=45384 or01171875) In my other work on music lyrics the latter optionenables you to control for the lsquochorusrsquo effect as the former can be drowned out by one song with repetitive words Given thenature of this collection I elected for the former approach Whatcan we learn from trigrams for example After an initial set of trigrams pertaining to the Louisbourg digitization project whichhas boilerplate language on each page as well as common ldquostop-wordrdquo equivalents of phrases (especially pronounced in French) we come across the first set of meaningful phrases excerptedbelow

ldquoexperiencerdquo ldquoinrdquo ldquotherdquo 919 ldquotransferredrdquo ldquotordquo ldquotherdquo918 ldquofundedrdquo ldquobyrdquo ldquotherdquo 918 ldquoofrdquo ldquotherdquo ldquoprovincialrdquo915 ldquothisrdquo ldquowebrdquo ldquositerdquo 915 ldquo409rdquo ldquoregistryrdquo ldquo2rdquo914 ldquoarerdquo ldquonordquo ldquolongerrdquo 914 ldquogouvernementrdquo ldquodurdquo ldquocanadardquo 910 ldquouserdquo ldquothisrdquo ldquolinkrdquo 887ldquopagerdquo ldquoorrdquo ldquoyourdquo 887 ldquofollowingrdquo ldquopagerdquo ldquoorrdquo 887 ldquoautomaticallyrdquo ldquotransferredrdquo ldquotordquo 887 ldquoberdquoldquoautomaticallyrdquo ldquotransferredrdquo 887 ldquowillrdquo ldquoberdquo ldquoautomaticallyrdquo 887 ldquoindustryrdquo ldquoyourdquo ldquowillrdquo 887ldquomultimediardquo ldquoindustryrdquo ldquoyourdquo 887 ldquopracticalrdquo ldquoexperiencerdquo ldquoinrdquo 887 ldquogainedrdquo ldquopracticalrdquo ldquoexperiencerdquo887 ldquotheyrdquo ldquogainedrdquo ldquopracticalrdquo 887 ldquowhilerdquo ldquotheyrdquoldquogainedrdquo 887 ldquocanadiansrdquo ldquowhilerdquo ldquotheyrdquo 887ldquoyoungrdquo ldquocanadiansrdquo ldquowhilerdquo 887 ldquoshowcaserdquo ldquoofrdquoldquoworkrdquo 887 ldquoexcellentrdquo ldquoshowcaserdquo ldquoofrdquo 887 hellip

Here we see an array of common language funding informa-tion link information project information spread across over 887

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

websites Reading down we see information about governments jobexperience project information initiatives re-structuring and soforth At a glance we can see what language is common to all web-sites as opposed to unique information Moving down we find themore common but less universal ldquothe history ofrdquo for example ldquoher-itage community foundationrdquo libraries and so forth By the end ofthe list we can move into the unique or very rare phrases that appearonly once twice or three times

Moving beyond word and phrase frequency there are a fewmore sophisticated off-the-shelf options that can help us make senseof large arrays of unstructured data Topic modeling is among themost developed as Princetonrsquos David M Blei (a Computer Scienceprofessor) describes in a general introduction to the topic

While more and more texts are available online we simplydo not have the human power to read and study them toprovide the kind of [traditional] browsing experiencedescribed above To this end machine learning researchershave developed probabilistic topic modeling a suite of algo-rithms that aim to discover and annotate large archives ofdocuments with thematic information Topic modelingalgorithms are statistical methods that analyze the words ofthe original texts to discover the themes that run throughthem how those themes are connected to each other andhow they change over time hellip Topic modeling enables usto organize and summarize electronic archives at a scalethat would be impossible by human annotation55

Using the aforementioned SEASR Meandre workbench (or viaMALLET itself and a command-line interface) we can quickly beginto run our born-digital information through the topic-modelingalgorithm The results are notable and come in two formats a wordcloud visualization as well as an XML file containing a line-by-linebreakdown of each topic Taking the aforementioned file 267 (ldquopastand presentrdquo) we can begin to get a sense of the topics this large web-site discusses

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

At a glance we can learn about some of the main themes the struc-ture of Alberta (Alberta towns bridges lakes) the geographic shapeof the area (life land lands areas links grass and interestingly lan-guage) the stories (story knight archives but also interestinglyknight and wife) and so forth The word cloud gives us a sense ofthe most relevant terms that define each topic but we also have a3200-line XML file broken into topics which lets us further refineour inquiry

A last set of examples will help us move from the aggregate levelof hundreds of thousands of lines of text to the individual line inessence creating our own search engine to navigate this collection

There are a few good reasons why we should not use Google todo all of this in our own work First we do not know how Googleworks precisely It is a proprietary black box While the original algo-rithm is available the actual methods that go into determining theranking of search terms is a trade secret Second Google is optimizedfor a number of uses mainly to deliver the most relevant informa-tion to the user Historians and archivists are looking often forspecific information within old websites and it behooves us to exam-ine them through the lens of more specialized tools With thesehesitations in mind I have sought to develop my own rudimentarysearch engine for this born-digital collection

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A two-pronged approach can take us to relevant informationquickly In the following example I am carrying out a project onblack history We quickly search all of the CDC word frequency filesfor the occurrence of the word ldquofreedomrdquo ldquoblackrdquo and ldquolibertyrdquo(these terms of course require some background knowledge thesearch is shaped by the user) obtain a list and then map it onto achart in order to quickly visualize their locations in the collection Forexample in the graph above we see a visualization of a given term inthis case our three search terms As one moves the mouse over eachbar the frequency data for a given term appears We can also zoomand enhance this graph to use it as we wish A demonstration of thiscan be found online at wwwianmilliganca20130118cdc-tool-1

We can then move down into the data In the example aboveldquo047rdquo labeled ldquoblackgoldtxtrdquo contains 31 occurrences of the wordldquoontariordquo Moving to my Mathematica keyword-in-context routinewe can see where these occurrences of the word appear

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

A snapshot of the dynamic finding aid tool

We see that this website focuses mostly on southwestern Ontariowhere many oil fields are If we move forward we can see that it per-tains to the nineteenth century a particular region as well assubsequent memorialization activities (mention of a museum forexample) All this can be done from within the textual client asopposed to going out into individual websites Broad contoursemerge from this distant reading but they also facilitate deeper read-ing by allowing us to promptly realize critical sources

Another model through which we can visualize documents isthrough a viewer primarily focusing on term document frequencyand inverse frequency terms discussed below To prepare this mater-ial we are combining our earlier data about word frequency (that theword ldquoexperiencerdquo appears 919 times perhaps) with aggregate infor-mation about frequency in the entire corpus (ie the wordldquoexperiencerdquo appears 1766 times) The number of times that a termappears in a given website can be defined as tf (term frequency)Another opportunity is inverse document frequency where we canlearn what words are most distinct and unique in a given documentThe formula for this is the oft-cited tf-idf a staple of informationretrieval theory or tf times the logarithm of the number of docu-ments divided into the number of documents where the given termappears At a glance then we can see some of the unique words con-tained within a website if we select ldquoinverse document frequencyrdquo

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

At a very quick glance one can see the major words that make this aunique document This technique can be used on any textual docu-ment

Off-the-shelf tools can facilitate a deeper yet distant reading ofindividual websites as well especially the suite of research tools onwebsites such as Many Eyes and Voyant Using these tools howeverrequires that one puts individual data into each separate programrather than quickly navigating between them This stems from themassive amount of files and data involved Many Eyes limits files to5MB and Voyant has no hard and fast rule but can struggle withlarger files

This analysis was carried out on my personal computer55 Largerdata sets may necessitate a corresponding increase in computationalpower bringing social sciences and humanities researchers into con-tact with High Performance Computing (HPC) centres or consortiaThese extremely powerful computers provide researchers with com-puting power orders of magnitude above that of personal computersallowing results in hours instead of the weeks or months it might

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

take to process on a smaller system56 This has particular applicationin the basic processing of a very large web archive There are limita-tions however jobs must be submitted into queues in advancerather than on-site processing and perhaps most importantly HPCcentres prefer to run highly optimized code to minimize usage oftheir very expensive and scarcely allocated equipment As this level ofcoding expertise is often beyond humanities researchers researchteams must be formed

Humanities researchers would do well to familiarize themselveswith local HPC options many of which are more than happy towork on new and engaging questions These are accessible optionsin Ontario for example all but one university belongs to a HPCconsortium57 Yet as personal computing continues to dramaticallyincrease in speed and as storage prices continue to plummet textualanalysis is increasingly feasible on high-quality office systems An ini-tial exposure can generate questions and problems that if the needrears itself can subsequently be taken to the supercomputer

This case study is intended to provide one model for distantreading and the generation of finding aids in the interest of showingsome of the potential for working with born-digital sources It isintended to begin or jumpstart a conversation around the resourcesand skills necessary for the next generation A seemingly insur-mountable amount of information contained across over 73000files (as in the CDC collection) can be quickly navigated and iso-lated for researcher convenience

Conclusion Digital Literacy for the Next Generation ofHistorians

As a discipline I would like to suggest one path forward prepare thenext generation of historians at least some of them to deal with thedeluge of born-digital sources Fundamentally the challenge of born-digital sources is a pedagogical one For existing historians this is anissue of continual education for future historians an issue of profes-sional training (the annual Digital Humanities Summer Instituteheld at the University of Victoria each June is worth considering asan introduction to many digital concepts) Some off-the-shelf tools

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

can serve as gateways to deeper engagement notably the VoyantTools suite (wwwvoyant-toolsorg) and perhaps subsequently thelessons contained in the Programming Historian 2 That said digitalliteracy courses should become part of the essential pantheon of his-torical curricula This should not be mandatory but instead taughtalongside and within traditional historiography and research meth-ods courses In an era of diminishing resources and attention paid toknowledge translation and relevancy digital literacy courses are alsowell positioned to provide graduates with concrete portfolios andactionable skills If we want to be prepared to use digital method-ologies to study the post-1990 world we will need ready historians

Such a course should have six basic priorities First we must pro-vide historiographical context for new students on whatcomputational methods have been tried before How have quantita-tive and qualitative methods differed yet ultimately complementedeach other Second students in this course would need to develop anunderstanding of the digital basics cloud computing effectivebackup and versioning digital security and an understanding of howthe Internet is organized Third for effective cross-departmental andsustained engagement with digital history students must to learn howto digitally navigate conventional sources notable skill sets wouldinclude a familiarity with citation management software databasesand the possibility of using optical character recognition on a varietyof born-paper sources subsequently digitized through scanners orcameras This must also include a grasp of the scope and significanceof large digital depositories including Google Books the InternetArchive and others mentioned earlier Fourth students must learnhow to process large amounts of digital information this includes off-the-shelf visualization tools material discussed here and some of thephilosophical and methodological implications of a distance readingFifth students can learn basic programming through the ProgrammingHistorian Finally and perhaps most importantly students need anunderstanding of various ways to openly disseminate material on the web whether through blogging micro-blogging publishing webpages and so forth it is important to enable the great work that we do in the university to make its mark on the world Throughthis students will also come away from history departments with

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

comprehensive and marketable digital portfolios Such a coursewould focus more on homework than reading with special emphasisplaced on deliverable content such as websites interactive modelsand broadly implementable databases

Digital history and the readiness to work with born-digitalsources cannot stem out of pedagogy alone however Institutionalsupport is necessary While based on anecdotal evidence it appearsthat some university administrators have begun to embrace the shifttowards the digital humanities At the federal level SSHRC has iden-tified the Digital Economy as one of its priority areas58 Yet when itcomes to evaluation doctoral and post-doctoral researchers do nothave a ldquodigital humanitiesrdquo option open to them in terms of assessingcommittees leading to fears that their projects might be lost inbetween disciplinary areas of focus59 In light of the impending crisisof born-digital sources historians should encourage these initiatives

These sorts of digital projects do not always lend themselveswell to traditional publication venues The examples above from theCDC case study for example are best viewed as dynamically mal-leable models this will be one challenge as historians grapple withnew publication models This article argues first why digitalmethodologies are necessary and second how we can realize aspectsof this call The Internet is 20 years old and before historians realizeit we will be faced with the crisis of born-digital information over-load If history is to continue as the leading discipline inunderstanding the social and cultural past decisive movementtowards the digital humanities is necessary When the sea comes inwe must be ready to swim

IAN MILLIGAN is an assistant professor of Canadian history at theUniversity of Waterloo He is a founding co-editor of ActiveHistorycaan editor-at-large of the Programming Historian 2 and has written sev-eral articles in the fields of Canadian youth digital and labour history

IAN MILLIGAN est professeur adjoint en histoire canadienne agrave laUniversity of Waterloo Il est le fondateur et codirecteur du site

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Internet ActiveHistoryca ainsi qursquoun eacutediteur du site ProgrammingHistorian 2 Il a eacutecrit plusieurs articles au sujet de la jeunesse cana-dienne de la numeacuterisation et de lrsquohistoire des travailleurs

Endnotes

1 The question of when the past becomes a ldquohistoryrdquo to be professionallystudied is a tricky one Yet it is worth noting that the still-definitivework on the Baby Boom Generation appeared in 1996 with DougOwram Born at the Right Time A History of the Baby Boom Generationin Canada (Toronto University of Toronto Press 1996) This actuallyfollowed the earlier scholarly monograph Cyril Levitt Children ofPrivilege Student Revolt in the Sixties (Toronto University of TorontoPress 1984) and a host of other laudable works As academicdebatesworks on the 1960s now suggest the past becomes historysooner than we often expect

2 A critical point advanced in the field of literary criticism and theory byStephen Ramsay Reading Machines Towards an Algorithmic Criticism(Urbana University of Illinois Press 2011)

3 See ldquoIdleNoMore Timelinerdquo ongoingwwwidlenomoremakookcatimeline44 ltviewed 18 January 2013gt

4 These issues have been discussed elsewhere albeit in either a moreabridged form or with a focus on digitized historical sources A notableexception is Daniel J Cohen and Roy Rosenzweig Digital History AGuide to Gathering Preserving and Presenting the Past on the Web(Philadelphia University of Pennsylvania Press 2005) specifically theirchapter on ldquoCollecting History onlinerdquo Their focus is on the preservation-ist and archival challenges however Important notes are also found inDan Cohen ldquoHistory and the Second Decade of the Webrdquo RethinkingHistory 8 no 2 (June 2004) 293ndash301 also available online atwwwchnmgmueduessays-on-history-new-mediaessaysessayid=34

5 Figure taken from Jose Martinez ldquoTwitter CEO Dick Costolo Reveals Staggering Number of Tweets Per Dayrdquo ComplexTech wwwcomplexcomtech201210twitter-ceo-dick-costolo-reveals-staggering-number-of-tweets-per-day ltviewed 6 January 2013gtFurther more while a tweet itself is only 140 characters every tweetcomes with over 40 lines of metadata date information author biogra-phy and location information when the account was created the usersrsquofollowing and follower numbers country data and so forth This is aremarkable treasure trove of information for future historians but also apotential source of information overload See Sarah Perez ldquoThis is What

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

a Tweet Looks Likerdquo ReadWriteWeb (19 April 2010) wwwreadwritewebcomarchivesthis_is_what_a_tweet_looks_likephpltviewed 6 January 2013gt

6 This is further discussed in Ian Foster ldquoHow Computation ChangesResearchrdquo in Switching Codes Thinking Through Digital Technology inthe Humanities and the Arts eds Thomas Bartscherer and RoderickCoover (Chicago University of Chicago Press 2011) esp 22ndash4

7 Franco Moretti Graphs Maps Trees Abstract Models for Literary History(New York Verso 2005) 3ndash4

8 While not of prime focus here the longue dureacutee deserves definition Itstands opposite the history of events of instances covering instead avery long time span in an interdisciplinary social science framework In along essay Braudel noted that one can then see both structural crises ofeconomies and structures that constrain human society and develop-ment For more see Fernand Braudel ldquoHistory and the Social SciencesThe Longue Dureacuteerdquo in Fernand Braudel On History trans SarahMatthews (Chicago University of Chicago Press 1980) 27 The essaywas originally printed in the Annales ESC no 4 (OctoberndashDecember1958)

9 A great overview can be found at ldquoText Analysisrdquo Tooling Up for DigitalHumanities 2011 wwwtoolingupstanfordedupage_id=981 ltviewed19 December 2012gt

10 Jean-Baptiste Michel Yuan Kui Shen Aviva Presser Aiden Adrian VeresMatthew K Gray The Google Books Team Joseph P Pickett DaleHoiberg Dan Clancy Peter Norvig Jon Orwant Steven Pinker MartinA Nowak Erez Lieberman Aiden ldquoQuantitative Analysis of CultureUsing Millions of Digitized Booksrdquo Science 331 (14 January 2011)176ndash82 The tool is publicly accessible at ldquoGoogle Books NgramViewerrdquo Google Books 2010 wwwbooksgooglecomngrams ltviewed11 January 2013gt

11 Anthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

12 Comment by Jean-Baptiste Michel and Erez Lieberman Aiden onAnthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

13 For more information see the Canadian Century Research InfrastructureProject website at wwwccriuottawacaCCRIHomehtml and the Programme de recherche en deacutemographie historique at wwwgeneologyumontrealcafrleprdhhtm

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

14 ldquoWith Criminal Intentrdquo CriminalIntentorg wwwcriminalintentorgltviewed 20 December 2012gt

15 For an overview see ldquoAward Recipients mdash 2011rdquo Digging into DataWebpage wwwdiggingintodataorgHomeAwardRecipients2011tabid185Defaultaspx ltviewed 20 December 2012gt

16 One can do no better than the completely free Programming Historianwhich uses entirely free and open-source programming tools to learn thebasics of computational history See The Programming Historian 2 2012-Present wwwprogramminghistorianorg ltviewed 11 January 2013gt

17 Michael Katz The People of Hamilton Canada West Family and Class ina Mid-Nineteenth-Century City (Cambridge MA Harvard UniversityPress 1975) Also see A Gordon Darroch and Michael D OrnsteinldquoEthnicity and Occupational Structure in Canada in 1871 The VerticalMosaic in Historical Perspectiverdquo Canadian Historical Review 61 no 3(1980) 305ndash33

18 Ian Anderson ldquoHistory and Computingrdquo Making History The ChangingFace of the Profession in Britain 2008 wwwhistoryacukmakinghistoryresourcesarticleshistory_and_computinghtml ltviewed 20 December2012gt

19 Christopher Collins ldquoText and Document Visualizationrdquo TheHumanities and Technology Camp Greater Toronto Area VisualizationBootcamp University of Ontario Institute of Technology Oshawa Ont21 October 2011

20 The quotations by Luther and Poe are commonly used to make thispoint For the best example see Clay Shirky Cognitive SurplusCreativity and Generosity in a Connected Age Google eBook Penguin2010 For Mumford see Lewis Mumford The Myth of the Machine vol2 The Pentagon of Power (New York Harcourt Brace 1970) 182 asquoted in Gleick The Information A History A Theory A Flood (NewYork Pantheon 2011) 404

21 Adam Ostrow ldquoCisco CRS-3 Download Library of Congress in JustOver One Secondrdquo Mashable (9 March 2010) accessed onlinewwwmashablecom20100309cisco-crs-3

22 A note on data size is necessary The basic unit of information is a ldquobyterdquo (8 bits) which is roughly a single character A kilobyte (KB) is athousand bytes a megabyte (MB) a million a gigabyte (GB) a billion aterabyte (TB) a trillion and a petabyte (PB) a quadrillion Beyond thisthere are exabytes (EB) zettabytes (ZB) and yottabytes (YB)

23 These figures can of course vary depending on compression methodschosen Their math was based on an average of 8MB per book if it wasscanned at 600dpi and then compressed There are 26 million books inthe Library of Congress For more see Mike Ashenfelder ldquoTransferringlsquoLibraries of Congressrsquo of Datardquo Library of Congress Digital Preservation

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Blog 11 July 2011wwwblogslocgovdigitalpreservation201107transferring-libraries-of-congress-of-data ltviewed 11 January 2013gt

24 Gleick 232 25 ldquoAbout the Library of Congress Web Archivesrdquo Library of Congress Web

Archiving FAQs Undated wwwlocgovwebarchivingfaqhtmlfaqs_02ltviewed 11 January 2013gt

26 ldquo10000000000000000 Bytes Archivedrdquo Internet Archive Blog 26October 2012 wwwblogarchiveorg2012102610000000000000000-bytes-archivedltviewed 19 December 2012gt

27 Jean-Baptiste Michel et al 176ndash82 Published Online Ahead of Print16 December 2010 wwwsciencemagorgcontent3316014176ltviewed 29 July 2011gt See also Scott Cleland ldquoGooglersquos PiracyLiabilityrdquo Forbes 9 November 2011 wwwforbescomsitesscottcle-land20111109googles-piracy-liability ltviewed 11 January 2013gt

28 ldquoGovernment of Canada Web Archiverdquo Library and Archives Canada 17October 2007 wwwcollectionscanadagccawebarchivesindex-ehtmlltviewed 11 January 2013gt

29 For background see Bill Curry ldquoVisiting Library and Archives inOttawa Not Without an Appointmentrdquo Globe and Mail 1 May 2012wwwtheglobeandmailcomnewspoliticsottawa-notebookvisiting-library-and-archives-in-ottawa-not-without-an-appointmentarticle2418960 ltviewed 22 May 2012gt and the announcement ldquoLAC Begins Implementation of New Approach to service Deliveryrdquo Library and Archives Canada February 2012wwwcollectionscanadagccawhats-new013-560-ehtml ltviewed 22May 2012gt For professional responses see the ldquoCHArsquos comment onNew Directions for Library and Archives Canadardquo CHA Website March2012 wwwcha-shccaenNews_39items19html ltviewed 19December 2012gt and ldquoSave Library amp Archives Canadardquo CanadianAssociation of University Teachers launched AprilMay 2012 wwwsavelibraryarchivesca ltviewed 19 December 2012gt

30 Ian Milligan ldquoThe Smokescreen of lsquoModernizationrsquo at Library andArchives Canadardquo ActiveHistoryca 22 May 2012wwwactivehistoryca201205the-smokescreen-of-modernization-at-library-and-archives-canada ltviewed 19 December 2012gt

31 ldquoSave LAC May 2012 Campaign Updaterdquo SaveLibraryArchivesca May2012 wwwsavelibraryarchivescaupdate-2012-05aspx ltviewed 19December 2012gt

32 The challenges and opportunities presented by such projects is extensively discussed in Daniel J Cohen ldquoThe Future of Preserving the Pastrdquo CRM The Journal of Heritage Stewardship 2 no 2

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

(Summer 2005) 6ndash19 also available freely online at the RoyRosenzweig Center for History and New Mediawwwchnmgmueduessays-on-history-new-mediaessaysessayid=39For more see ldquoSeptember 11 Digital Archive Saving the Histories of September 11 2001rdquo Centre for History and New Media andAmerican Social History Project Center for Media and Learningwww911digitalarchiveorg ltviewed 11 January 2013gt ldquoHurricaneDigital Memory Bank Collecting and Preserving the Stories of Katrina and Ritardquo Center for History and New Media wwwhurricanearchiveorg ltviewed 11 January 2013gt and ldquoOccupyArchive Archiving the Occupy Movements from 2011rdquo Centre forHistory and New Media wwwoccupyarchiveorg ltviewed 11 January2013gt

33 Daniel Cohen and Roy Rosenzweig Digital History A Guide to Gathering Preserving and Presenting the Past on the Web(Philadelphia Temple University Press 2005) chapter on ldquoPreservingDigital Historyrdquo The chapter is available fully online atwwwchnmgmuedudigitalhistorypreserving5php ltviewed 19December 2012gt

34 For an overview see ldquoPreserving Our Digital Heritage The NationalDigital Information Infrastructure and Preservation Program 2010 Report a Collaborative Initiative of the Library of CongressrdquoDigitalpreservationgov wwwdigitalpreservationgovmultimediadocumentsNDIIPP2010Report_Postpdf ltviewed 19 December 2012gt

35 Steve Meloan ldquoNo Way to Run a Culturerdquo Wired 13 February 1998available online via the Internet Archivewwwwebarchiveorgweb20000619001705httpwwwwiredcomnewsculture012841030100html ltviewed 19 December 2012gt

36 Some of this has been discussed in greater depth on my website See theseries beginning with Ian Milligan ldquoWARC Files A Challenge forHistorians and Finding Needles in Haystacksrdquo ianmilliganca 12December 2012 wwwianmilliganca20121212warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks ltviewed 20 December2012gt

37 Unstructured data as opposed to structured data is important for histo-rians but brings additional challenges Some projects have producedstructured data novels marked up using XML encoding for example soone can immediately see place names dates character names and soforth This makes it much easier for a computer to read and understand

38 See William Noblett ldquoDigitization A Cautionary Talerdquo New Review ofAcademic Librarianship 17 (2011) 3 Tobias Blanke Michael Bryantand Mark Hedges ldquoOcropodium Open Source OCR for Small-ScaleHistorical Archivesrdquo Journal of Information Science 38 no 1 (2012) 83

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 22: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

few keystrokes we can bring an unparalleled quantity of informationto our home or work computers With Canadarsquos Digital Collectionsa single command line entry brought the entire archive onto my sys-tem it took days to process the information and I had to leave mycomputer on but it did not require any active user engagement

How big is the CDC collection in terms of sheer informationThe plaintext without any HTML markup is approximately7805135 words If each page contained 250 words it would be theequivalent of a 31220 page essay or 1000 dissertations However itis even more complicated than just reading 1000 dissertations Theinformation is spread across 73585 separate HTML files themselvespart of a much broader corpus of 278785 files Even the sheer sizeof the database requires big data methodologies an aggregate size of726 GB 360 MB of plaintext with HTML encoding reduced to alsquomerersquo 50MB if all encoding is removed A 50MB plain text docu-ment can quickly slow down even a relatively powerful personalcomputer There is no simple way to make this a conventionallyaccessible amount of information

How can we access this information Several different methodsare possible First we can work with the information on the internetitself or we can download it onto our own system I prefer the lattermethod as it decreases computational time adds flexibility andenables free rein to experiment with material Before you do thisthough you need to be familiar with the websitersquos ldquoterms of servicerdquoor the legal documentation that outlines how users are allowed to usea website resources or software In this case LAC provides generousnon-commercial reproduction terms information ldquomay be repro-duced in part or in whole and by any means without charge orfurther permission unless otherwise specifiedrdquo51

The CDC web archive index is organized like so the index islocated at wwwepelac-bacgcca100205301iccdcEAlphabetaspwith every subsequent website branching off the cdc folder Forexample one website would be hosted in wwwepelac-bacgcca100205301iccdcABbenevolat the next in hellipcdcabnatureand so on Using wget a free software program you can downloadall websites in a certain structure the program recursively downloadsinformation in a methodical fashion I first used wget to save the

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

cookies (little files kept on your system in this case one that notedthat you had seen the disclaimer about the archival nature of thewebsite) and then with an one-line command in my command lineto download the entire database52 Due to the imposition of a ran-dom wait and bandwidth limit which meant that I would downloadone file every half to one and a half seconds at a maximum of 30kilobytes a second the process took six days In the ProgrammingHistorian 2 I have an article expanding upon wget for humanitiesresearchers

Hypertext itself deserves a brief discussion The basic buildingblock of the web is the link one page is connected to another andpeople navigate through these webs While books and articles forexample are connected to each other by virtue of ideas and concretecitations these must often be inferred online HTML born-digitalsources are different in that the link offers a very concrete file One

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Visualizing the CDC Collection as a Series of Hyperlinks

quick visualization the circular cluster seen above lets us see theunique nature of this information We can go to each web pagescrape the links and see where they take viewers As we see in thebelow figure the CDC collection can be visualized as a series ofhyperlinks (each line) The cluster in the middle is the CDC collec-tion itself whereas the strands on the outside represent externalwebsites Using Mathematica I have generated a dynamic modelyou can move your mouse over each line learn what link it is andthus see where the collection fits into the broader Internet The siteis limited to following two links beyond the first page Through thisapproach we see the interconnection of knowledge and relationshipsbetween different sources At a glimpse we can see suggestions to fol-low other sites and determine if any nodes are shared amongst thevarious websites revealed

Visualization Moving Beyond Word Clouds

Where to begin with a large collection such as the Canadarsquos DigitalCollection corpus As a historian who is primarily interested in thewritten word my first step was to isolate the plaintext throughout thematerial so that I could begin rudimentary data mining and textualanalysis on the corpus For this I approached the CDC collection ontwo distinct levels mdash first the aggregate collection and second divid-ing the collection into each lsquofilersquo or individual website mdash derivingword and phrase data for each Imagine if you could look throughinto an archival collection and get a quick sense of what topics andideas were covered Taking that thought further imagine the utility ifyou could do this with individual archival boxes and even files Witha plethora of born-digital sources this is now a necessary step

On an aggregate level word and phrase data is a way to explorethe information within How often do various words within thenearly eight million words of the collection appear UsingMathematica we can quickly generate a list of words sorted by theirrelative frequency we learn that the word ldquoCanadardquo appears 26429times ldquocollectionsrdquo 11436 and ldquo[L]ouisbourgrdquo an astounding9365 times Reading such a list can quickly wear out a reader and isof minimal visual attractiveness Two quick decisions were made to

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

enhance data usability first the information was ldquocase-foldedrdquo allsent to lower-case this enables us to count the total appearances of aword such as ldquodigitalrdquo regardless of whether or not it appears at thebeginning of a sentence

Second when dealing with word frequency mdash as opposed tophrase frequency mdash ldquostop wordsrdquo should be removed as they werehere These are ldquoextremely common words which would appear to beof little value in helping select documents matching a user needrdquowords such as ldquotherdquo ldquoandrdquo ldquotordquo and so forth They matter forphrases (the example ldquoto be or not to berdquo is comprised entirely ofstop words) but will quickly dominate any word-level visualizationor list53 Other words may need to be added or removed from stoplists depending on the corpus for example if the phrase ldquopage num-berrdquo appeared on every page that would be useless information forthe researcher

The first step then is to use Mathematica to format word fre-quency information into Wordlenet readable format (if we put allof the text into that website it will crash mdash so we need to followthe instructions on that website) we can then draw up a wordcloud of the entire collection Here we can see both the advantagesand disadvantages of static word clouds What can we learn from this

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A Word Cloud of All Words in the CDC Corpus

illustration First without knowing anything about the sites we canimmediately learn that it is dealing with CanadaCanadians that itis a collection with documents in French and English that we aredealing with digitized collections and that there is a fair amount ofregional representation not based upon population (note thatldquoontariordquo is less frequent than both ldquonewfoundlandrdquo ldquonova scotiardquoand ldquoalbertardquo although ldquoqueacutebecrdquo and ldquoquebecrdquo is not terribly promi-nent) Certain kinds of history dominate communities familynames the built form (houses churches schools and so forth) At avery macro level we can learn several characteristics before doing anydeeper investigation

These forms of visualization need to be used with extreme cau-tion however First while visually attractive it leaves a severeamount of ambiguity are ldquoontariordquo and ldquoqueacutebecrdquo equally promi-nent for example or is there a slight divergence Even moreproblematically we cannot learn context from the above picturedoes the word ldquoschoolrdquo refer to built structures or educational expe-riences The same question with applies to ldquochurchrdquo Do mentionsof ldquonewfoundlandrdquo refer to historical experiences there or withexpatriates or in travel narratives Word clouds are useful first stepsand have their place but need to be used with extreme caution Formore scholarly applications tables offer much more precise inter-pretation As the second figure below demonstrates they can bemore helpful when dealing with the individual sites cuttingthrough ambiguity The figure below is a visualization of the web-site ldquopast and presentrdquo Despite the ambiguous title we can quicklylearn what this website contains from this visualization Wordle isthus a useful first step analogous to taking magically taking a quicklook through the side of an archival box We still need more infor-mation however to really make this form of textual analysishelpful To navigate this problem I have created my own dynamicfinding aid program using Mathematica which scales to any born-digital web collection

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Two approaches were taken First a textual representation of wordand phrase frequency allows us to determine the content of a given filerelatively quickly I wrote a short program to extract word frequency(stop words excluded) and phrase frequency For phrases we speak ofn-grams bigrams trigrams quadgrams and fivegrams although onecould theoretically do any higher number A bigram can be a combina-tion of two characters such as lsquobirsquo two syllables or mdash pertinent in ourcase mdash two words An example ldquocanada isrdquo A trigram is three wordsquadgram is four words and a fivegram is five words and so on

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Word Cloud of the File ldquoPast and Presentrdquo

A viewer for n-grams in the CDC Collection

The CDC collection demonstrates this approachrsquos utility Usingmy Mathematica textual CDC browser pictured above I randomlyexplored file 267 the ambiguously named ldquopast and presentrdquo web-site This is one of the largest individual websites 122363 words orabout 493 standard typed pages Such an ambiguous name could beabout almost anything Initially I decided to study word frequency

ldquoldquoalbertardquo 758rdquo ldquoldquolistenrdquo 490rdquo ldquoldquosettlersrdquo 467rdquoldquoldquonewrdquo 454rdquo ldquoldquocanadardquo 388rdquo ldquoldquolandrdquo 362rdquo ldquoldquoreadrdquo328rdquo ldquoldquopeoplerdquo 295rdquo ldquoldquosettlementrdquo 276rdquo ldquoldquochineserdquo268rdquo ldquoldquocanadianrdquo 264rdquo ldquoldquoplacerdquo 255rdquo ldquoldquoheritagerdquo254rdquo ldquoldquoedmontonrdquo 250rdquo ldquoldquoyearsrdquo 238rdquo ldquoldquowestrdquo 236rdquoldquoldquoraymondrdquo 218rdquo ldquoldquoukrainianrdquo 216rdquo ldquoldquocamerdquo 213rdquoldquoldquocalgaryrdquo 209rdquo ldquoldquonamesrdquo 206rdquo ldquoldquoranchrdquo 204rdquoldquoldquoupdatedrdquo 193rdquo ldquoldquohistoryrdquo 190rdquo ldquoldquocommunitiesrdquo183rdquo hellip

Rather than having to visit the website at a glance we can almostimmediately see what the site is about More topics become apparentwhen one looks at trigrams for example in this automatically gen-erated output of generated top phrases

ldquoldquofirstrdquo ldquopeoplerdquo ldquosettlersrdquo 100rdquo ldquoldquotherdquo ldquounitedrdquoldquostatesrdquo 72rdquo ldquoldquoonerdquo ldquoofrdquo ldquotherdquo 59rdquo ldquoldquoofrdquo ldquotherdquoldquowestrdquo 58rdquo ldquoldquosettlersrdquo ldquolastrdquo ldquoupdatedrdquo 56rdquo ldquoldquopeoplerdquoldquosettlersrdquo ldquolastrdquo 56rdquo ldquoldquoalbertansrdquo ldquofirstrdquo ldquopeoplerdquo 56rdquoldquoldquoadventurousrdquo ldquoalbertansrdquo ldquofirstrdquo 56rdquo ldquoldquotherdquo ldquobarrdquoldquourdquo 55rdquo ldquoldquoopeningrdquo ldquoofrdquo ldquotherdquo 54rdquo ldquoldquocommunitiesrdquoldquoadventurousrdquo ldquoalbertansrdquo 54rdquo ldquoldquonewrdquo ldquocommunitiesrdquoldquoadventurousrdquo 54rdquo ldquoldquobarrdquo ldquourdquo ldquoranchrdquo 53rdquo ldquoldquolistenrdquoldquotordquo ldquotherdquo 52rdquo hellip

Through a few clicks we can navigate and determine the basiccontours mdash from a distance mdash of what the site pertains to In aviewer you can use a slider to dynamically see more informationor less information quickly toggling between bigrams trigramsand quadgrams allowing you to quickly hone in on immediately

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

relevant information This can be useful but we need to moveback even further to achieve a distant reading of this born-digitalcollection

If one zooms back out to this very distant level we can extractn-gram data from the entire collection We have two approachesfirst aggregate appearances of a given term (if ldquoaboriginalrdquo appears457 times then n=457) or second relative appearances of a giventerm (if ldquoaboriginalrdquo appears in 45 of 384 sites then n=45384 or01171875) In my other work on music lyrics the latter optionenables you to control for the lsquochorusrsquo effect as the former can be drowned out by one song with repetitive words Given thenature of this collection I elected for the former approach Whatcan we learn from trigrams for example After an initial set of trigrams pertaining to the Louisbourg digitization project whichhas boilerplate language on each page as well as common ldquostop-wordrdquo equivalents of phrases (especially pronounced in French) we come across the first set of meaningful phrases excerptedbelow

ldquoexperiencerdquo ldquoinrdquo ldquotherdquo 919 ldquotransferredrdquo ldquotordquo ldquotherdquo918 ldquofundedrdquo ldquobyrdquo ldquotherdquo 918 ldquoofrdquo ldquotherdquo ldquoprovincialrdquo915 ldquothisrdquo ldquowebrdquo ldquositerdquo 915 ldquo409rdquo ldquoregistryrdquo ldquo2rdquo914 ldquoarerdquo ldquonordquo ldquolongerrdquo 914 ldquogouvernementrdquo ldquodurdquo ldquocanadardquo 910 ldquouserdquo ldquothisrdquo ldquolinkrdquo 887ldquopagerdquo ldquoorrdquo ldquoyourdquo 887 ldquofollowingrdquo ldquopagerdquo ldquoorrdquo 887 ldquoautomaticallyrdquo ldquotransferredrdquo ldquotordquo 887 ldquoberdquoldquoautomaticallyrdquo ldquotransferredrdquo 887 ldquowillrdquo ldquoberdquo ldquoautomaticallyrdquo 887 ldquoindustryrdquo ldquoyourdquo ldquowillrdquo 887ldquomultimediardquo ldquoindustryrdquo ldquoyourdquo 887 ldquopracticalrdquo ldquoexperiencerdquo ldquoinrdquo 887 ldquogainedrdquo ldquopracticalrdquo ldquoexperiencerdquo887 ldquotheyrdquo ldquogainedrdquo ldquopracticalrdquo 887 ldquowhilerdquo ldquotheyrdquoldquogainedrdquo 887 ldquocanadiansrdquo ldquowhilerdquo ldquotheyrdquo 887ldquoyoungrdquo ldquocanadiansrdquo ldquowhilerdquo 887 ldquoshowcaserdquo ldquoofrdquoldquoworkrdquo 887 ldquoexcellentrdquo ldquoshowcaserdquo ldquoofrdquo 887 hellip

Here we see an array of common language funding informa-tion link information project information spread across over 887

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

websites Reading down we see information about governments jobexperience project information initiatives re-structuring and soforth At a glance we can see what language is common to all web-sites as opposed to unique information Moving down we find themore common but less universal ldquothe history ofrdquo for example ldquoher-itage community foundationrdquo libraries and so forth By the end ofthe list we can move into the unique or very rare phrases that appearonly once twice or three times

Moving beyond word and phrase frequency there are a fewmore sophisticated off-the-shelf options that can help us make senseof large arrays of unstructured data Topic modeling is among themost developed as Princetonrsquos David M Blei (a Computer Scienceprofessor) describes in a general introduction to the topic

While more and more texts are available online we simplydo not have the human power to read and study them toprovide the kind of [traditional] browsing experiencedescribed above To this end machine learning researchershave developed probabilistic topic modeling a suite of algo-rithms that aim to discover and annotate large archives ofdocuments with thematic information Topic modelingalgorithms are statistical methods that analyze the words ofthe original texts to discover the themes that run throughthem how those themes are connected to each other andhow they change over time hellip Topic modeling enables usto organize and summarize electronic archives at a scalethat would be impossible by human annotation55

Using the aforementioned SEASR Meandre workbench (or viaMALLET itself and a command-line interface) we can quickly beginto run our born-digital information through the topic-modelingalgorithm The results are notable and come in two formats a wordcloud visualization as well as an XML file containing a line-by-linebreakdown of each topic Taking the aforementioned file 267 (ldquopastand presentrdquo) we can begin to get a sense of the topics this large web-site discusses

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

At a glance we can learn about some of the main themes the struc-ture of Alberta (Alberta towns bridges lakes) the geographic shapeof the area (life land lands areas links grass and interestingly lan-guage) the stories (story knight archives but also interestinglyknight and wife) and so forth The word cloud gives us a sense ofthe most relevant terms that define each topic but we also have a3200-line XML file broken into topics which lets us further refineour inquiry

A last set of examples will help us move from the aggregate levelof hundreds of thousands of lines of text to the individual line inessence creating our own search engine to navigate this collection

There are a few good reasons why we should not use Google todo all of this in our own work First we do not know how Googleworks precisely It is a proprietary black box While the original algo-rithm is available the actual methods that go into determining theranking of search terms is a trade secret Second Google is optimizedfor a number of uses mainly to deliver the most relevant informa-tion to the user Historians and archivists are looking often forspecific information within old websites and it behooves us to exam-ine them through the lens of more specialized tools With thesehesitations in mind I have sought to develop my own rudimentarysearch engine for this born-digital collection

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A two-pronged approach can take us to relevant informationquickly In the following example I am carrying out a project onblack history We quickly search all of the CDC word frequency filesfor the occurrence of the word ldquofreedomrdquo ldquoblackrdquo and ldquolibertyrdquo(these terms of course require some background knowledge thesearch is shaped by the user) obtain a list and then map it onto achart in order to quickly visualize their locations in the collection Forexample in the graph above we see a visualization of a given term inthis case our three search terms As one moves the mouse over eachbar the frequency data for a given term appears We can also zoomand enhance this graph to use it as we wish A demonstration of thiscan be found online at wwwianmilliganca20130118cdc-tool-1

We can then move down into the data In the example aboveldquo047rdquo labeled ldquoblackgoldtxtrdquo contains 31 occurrences of the wordldquoontariordquo Moving to my Mathematica keyword-in-context routinewe can see where these occurrences of the word appear

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

A snapshot of the dynamic finding aid tool

We see that this website focuses mostly on southwestern Ontariowhere many oil fields are If we move forward we can see that it per-tains to the nineteenth century a particular region as well assubsequent memorialization activities (mention of a museum forexample) All this can be done from within the textual client asopposed to going out into individual websites Broad contoursemerge from this distant reading but they also facilitate deeper read-ing by allowing us to promptly realize critical sources

Another model through which we can visualize documents isthrough a viewer primarily focusing on term document frequencyand inverse frequency terms discussed below To prepare this mater-ial we are combining our earlier data about word frequency (that theword ldquoexperiencerdquo appears 919 times perhaps) with aggregate infor-mation about frequency in the entire corpus (ie the wordldquoexperiencerdquo appears 1766 times) The number of times that a termappears in a given website can be defined as tf (term frequency)Another opportunity is inverse document frequency where we canlearn what words are most distinct and unique in a given documentThe formula for this is the oft-cited tf-idf a staple of informationretrieval theory or tf times the logarithm of the number of docu-ments divided into the number of documents where the given termappears At a glance then we can see some of the unique words con-tained within a website if we select ldquoinverse document frequencyrdquo

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

At a very quick glance one can see the major words that make this aunique document This technique can be used on any textual docu-ment

Off-the-shelf tools can facilitate a deeper yet distant reading ofindividual websites as well especially the suite of research tools onwebsites such as Many Eyes and Voyant Using these tools howeverrequires that one puts individual data into each separate programrather than quickly navigating between them This stems from themassive amount of files and data involved Many Eyes limits files to5MB and Voyant has no hard and fast rule but can struggle withlarger files

This analysis was carried out on my personal computer55 Largerdata sets may necessitate a corresponding increase in computationalpower bringing social sciences and humanities researchers into con-tact with High Performance Computing (HPC) centres or consortiaThese extremely powerful computers provide researchers with com-puting power orders of magnitude above that of personal computersallowing results in hours instead of the weeks or months it might

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

take to process on a smaller system56 This has particular applicationin the basic processing of a very large web archive There are limita-tions however jobs must be submitted into queues in advancerather than on-site processing and perhaps most importantly HPCcentres prefer to run highly optimized code to minimize usage oftheir very expensive and scarcely allocated equipment As this level ofcoding expertise is often beyond humanities researchers researchteams must be formed

Humanities researchers would do well to familiarize themselveswith local HPC options many of which are more than happy towork on new and engaging questions These are accessible optionsin Ontario for example all but one university belongs to a HPCconsortium57 Yet as personal computing continues to dramaticallyincrease in speed and as storage prices continue to plummet textualanalysis is increasingly feasible on high-quality office systems An ini-tial exposure can generate questions and problems that if the needrears itself can subsequently be taken to the supercomputer

This case study is intended to provide one model for distantreading and the generation of finding aids in the interest of showingsome of the potential for working with born-digital sources It isintended to begin or jumpstart a conversation around the resourcesand skills necessary for the next generation A seemingly insur-mountable amount of information contained across over 73000files (as in the CDC collection) can be quickly navigated and iso-lated for researcher convenience

Conclusion Digital Literacy for the Next Generation ofHistorians

As a discipline I would like to suggest one path forward prepare thenext generation of historians at least some of them to deal with thedeluge of born-digital sources Fundamentally the challenge of born-digital sources is a pedagogical one For existing historians this is anissue of continual education for future historians an issue of profes-sional training (the annual Digital Humanities Summer Instituteheld at the University of Victoria each June is worth considering asan introduction to many digital concepts) Some off-the-shelf tools

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

can serve as gateways to deeper engagement notably the VoyantTools suite (wwwvoyant-toolsorg) and perhaps subsequently thelessons contained in the Programming Historian 2 That said digitalliteracy courses should become part of the essential pantheon of his-torical curricula This should not be mandatory but instead taughtalongside and within traditional historiography and research meth-ods courses In an era of diminishing resources and attention paid toknowledge translation and relevancy digital literacy courses are alsowell positioned to provide graduates with concrete portfolios andactionable skills If we want to be prepared to use digital method-ologies to study the post-1990 world we will need ready historians

Such a course should have six basic priorities First we must pro-vide historiographical context for new students on whatcomputational methods have been tried before How have quantita-tive and qualitative methods differed yet ultimately complementedeach other Second students in this course would need to develop anunderstanding of the digital basics cloud computing effectivebackup and versioning digital security and an understanding of howthe Internet is organized Third for effective cross-departmental andsustained engagement with digital history students must to learn howto digitally navigate conventional sources notable skill sets wouldinclude a familiarity with citation management software databasesand the possibility of using optical character recognition on a varietyof born-paper sources subsequently digitized through scanners orcameras This must also include a grasp of the scope and significanceof large digital depositories including Google Books the InternetArchive and others mentioned earlier Fourth students must learnhow to process large amounts of digital information this includes off-the-shelf visualization tools material discussed here and some of thephilosophical and methodological implications of a distance readingFifth students can learn basic programming through the ProgrammingHistorian Finally and perhaps most importantly students need anunderstanding of various ways to openly disseminate material on the web whether through blogging micro-blogging publishing webpages and so forth it is important to enable the great work that we do in the university to make its mark on the world Throughthis students will also come away from history departments with

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

comprehensive and marketable digital portfolios Such a coursewould focus more on homework than reading with special emphasisplaced on deliverable content such as websites interactive modelsand broadly implementable databases

Digital history and the readiness to work with born-digitalsources cannot stem out of pedagogy alone however Institutionalsupport is necessary While based on anecdotal evidence it appearsthat some university administrators have begun to embrace the shifttowards the digital humanities At the federal level SSHRC has iden-tified the Digital Economy as one of its priority areas58 Yet when itcomes to evaluation doctoral and post-doctoral researchers do nothave a ldquodigital humanitiesrdquo option open to them in terms of assessingcommittees leading to fears that their projects might be lost inbetween disciplinary areas of focus59 In light of the impending crisisof born-digital sources historians should encourage these initiatives

These sorts of digital projects do not always lend themselveswell to traditional publication venues The examples above from theCDC case study for example are best viewed as dynamically mal-leable models this will be one challenge as historians grapple withnew publication models This article argues first why digitalmethodologies are necessary and second how we can realize aspectsof this call The Internet is 20 years old and before historians realizeit we will be faced with the crisis of born-digital information over-load If history is to continue as the leading discipline inunderstanding the social and cultural past decisive movementtowards the digital humanities is necessary When the sea comes inwe must be ready to swim

IAN MILLIGAN is an assistant professor of Canadian history at theUniversity of Waterloo He is a founding co-editor of ActiveHistorycaan editor-at-large of the Programming Historian 2 and has written sev-eral articles in the fields of Canadian youth digital and labour history

IAN MILLIGAN est professeur adjoint en histoire canadienne agrave laUniversity of Waterloo Il est le fondateur et codirecteur du site

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Internet ActiveHistoryca ainsi qursquoun eacutediteur du site ProgrammingHistorian 2 Il a eacutecrit plusieurs articles au sujet de la jeunesse cana-dienne de la numeacuterisation et de lrsquohistoire des travailleurs

Endnotes

1 The question of when the past becomes a ldquohistoryrdquo to be professionallystudied is a tricky one Yet it is worth noting that the still-definitivework on the Baby Boom Generation appeared in 1996 with DougOwram Born at the Right Time A History of the Baby Boom Generationin Canada (Toronto University of Toronto Press 1996) This actuallyfollowed the earlier scholarly monograph Cyril Levitt Children ofPrivilege Student Revolt in the Sixties (Toronto University of TorontoPress 1984) and a host of other laudable works As academicdebatesworks on the 1960s now suggest the past becomes historysooner than we often expect

2 A critical point advanced in the field of literary criticism and theory byStephen Ramsay Reading Machines Towards an Algorithmic Criticism(Urbana University of Illinois Press 2011)

3 See ldquoIdleNoMore Timelinerdquo ongoingwwwidlenomoremakookcatimeline44 ltviewed 18 January 2013gt

4 These issues have been discussed elsewhere albeit in either a moreabridged form or with a focus on digitized historical sources A notableexception is Daniel J Cohen and Roy Rosenzweig Digital History AGuide to Gathering Preserving and Presenting the Past on the Web(Philadelphia University of Pennsylvania Press 2005) specifically theirchapter on ldquoCollecting History onlinerdquo Their focus is on the preservation-ist and archival challenges however Important notes are also found inDan Cohen ldquoHistory and the Second Decade of the Webrdquo RethinkingHistory 8 no 2 (June 2004) 293ndash301 also available online atwwwchnmgmueduessays-on-history-new-mediaessaysessayid=34

5 Figure taken from Jose Martinez ldquoTwitter CEO Dick Costolo Reveals Staggering Number of Tweets Per Dayrdquo ComplexTech wwwcomplexcomtech201210twitter-ceo-dick-costolo-reveals-staggering-number-of-tweets-per-day ltviewed 6 January 2013gtFurther more while a tweet itself is only 140 characters every tweetcomes with over 40 lines of metadata date information author biogra-phy and location information when the account was created the usersrsquofollowing and follower numbers country data and so forth This is aremarkable treasure trove of information for future historians but also apotential source of information overload See Sarah Perez ldquoThis is What

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

a Tweet Looks Likerdquo ReadWriteWeb (19 April 2010) wwwreadwritewebcomarchivesthis_is_what_a_tweet_looks_likephpltviewed 6 January 2013gt

6 This is further discussed in Ian Foster ldquoHow Computation ChangesResearchrdquo in Switching Codes Thinking Through Digital Technology inthe Humanities and the Arts eds Thomas Bartscherer and RoderickCoover (Chicago University of Chicago Press 2011) esp 22ndash4

7 Franco Moretti Graphs Maps Trees Abstract Models for Literary History(New York Verso 2005) 3ndash4

8 While not of prime focus here the longue dureacutee deserves definition Itstands opposite the history of events of instances covering instead avery long time span in an interdisciplinary social science framework In along essay Braudel noted that one can then see both structural crises ofeconomies and structures that constrain human society and develop-ment For more see Fernand Braudel ldquoHistory and the Social SciencesThe Longue Dureacuteerdquo in Fernand Braudel On History trans SarahMatthews (Chicago University of Chicago Press 1980) 27 The essaywas originally printed in the Annales ESC no 4 (OctoberndashDecember1958)

9 A great overview can be found at ldquoText Analysisrdquo Tooling Up for DigitalHumanities 2011 wwwtoolingupstanfordedupage_id=981 ltviewed19 December 2012gt

10 Jean-Baptiste Michel Yuan Kui Shen Aviva Presser Aiden Adrian VeresMatthew K Gray The Google Books Team Joseph P Pickett DaleHoiberg Dan Clancy Peter Norvig Jon Orwant Steven Pinker MartinA Nowak Erez Lieberman Aiden ldquoQuantitative Analysis of CultureUsing Millions of Digitized Booksrdquo Science 331 (14 January 2011)176ndash82 The tool is publicly accessible at ldquoGoogle Books NgramViewerrdquo Google Books 2010 wwwbooksgooglecomngrams ltviewed11 January 2013gt

11 Anthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

12 Comment by Jean-Baptiste Michel and Erez Lieberman Aiden onAnthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

13 For more information see the Canadian Century Research InfrastructureProject website at wwwccriuottawacaCCRIHomehtml and the Programme de recherche en deacutemographie historique at wwwgeneologyumontrealcafrleprdhhtm

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

14 ldquoWith Criminal Intentrdquo CriminalIntentorg wwwcriminalintentorgltviewed 20 December 2012gt

15 For an overview see ldquoAward Recipients mdash 2011rdquo Digging into DataWebpage wwwdiggingintodataorgHomeAwardRecipients2011tabid185Defaultaspx ltviewed 20 December 2012gt

16 One can do no better than the completely free Programming Historianwhich uses entirely free and open-source programming tools to learn thebasics of computational history See The Programming Historian 2 2012-Present wwwprogramminghistorianorg ltviewed 11 January 2013gt

17 Michael Katz The People of Hamilton Canada West Family and Class ina Mid-Nineteenth-Century City (Cambridge MA Harvard UniversityPress 1975) Also see A Gordon Darroch and Michael D OrnsteinldquoEthnicity and Occupational Structure in Canada in 1871 The VerticalMosaic in Historical Perspectiverdquo Canadian Historical Review 61 no 3(1980) 305ndash33

18 Ian Anderson ldquoHistory and Computingrdquo Making History The ChangingFace of the Profession in Britain 2008 wwwhistoryacukmakinghistoryresourcesarticleshistory_and_computinghtml ltviewed 20 December2012gt

19 Christopher Collins ldquoText and Document Visualizationrdquo TheHumanities and Technology Camp Greater Toronto Area VisualizationBootcamp University of Ontario Institute of Technology Oshawa Ont21 October 2011

20 The quotations by Luther and Poe are commonly used to make thispoint For the best example see Clay Shirky Cognitive SurplusCreativity and Generosity in a Connected Age Google eBook Penguin2010 For Mumford see Lewis Mumford The Myth of the Machine vol2 The Pentagon of Power (New York Harcourt Brace 1970) 182 asquoted in Gleick The Information A History A Theory A Flood (NewYork Pantheon 2011) 404

21 Adam Ostrow ldquoCisco CRS-3 Download Library of Congress in JustOver One Secondrdquo Mashable (9 March 2010) accessed onlinewwwmashablecom20100309cisco-crs-3

22 A note on data size is necessary The basic unit of information is a ldquobyterdquo (8 bits) which is roughly a single character A kilobyte (KB) is athousand bytes a megabyte (MB) a million a gigabyte (GB) a billion aterabyte (TB) a trillion and a petabyte (PB) a quadrillion Beyond thisthere are exabytes (EB) zettabytes (ZB) and yottabytes (YB)

23 These figures can of course vary depending on compression methodschosen Their math was based on an average of 8MB per book if it wasscanned at 600dpi and then compressed There are 26 million books inthe Library of Congress For more see Mike Ashenfelder ldquoTransferringlsquoLibraries of Congressrsquo of Datardquo Library of Congress Digital Preservation

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Blog 11 July 2011wwwblogslocgovdigitalpreservation201107transferring-libraries-of-congress-of-data ltviewed 11 January 2013gt

24 Gleick 232 25 ldquoAbout the Library of Congress Web Archivesrdquo Library of Congress Web

Archiving FAQs Undated wwwlocgovwebarchivingfaqhtmlfaqs_02ltviewed 11 January 2013gt

26 ldquo10000000000000000 Bytes Archivedrdquo Internet Archive Blog 26October 2012 wwwblogarchiveorg2012102610000000000000000-bytes-archivedltviewed 19 December 2012gt

27 Jean-Baptiste Michel et al 176ndash82 Published Online Ahead of Print16 December 2010 wwwsciencemagorgcontent3316014176ltviewed 29 July 2011gt See also Scott Cleland ldquoGooglersquos PiracyLiabilityrdquo Forbes 9 November 2011 wwwforbescomsitesscottcle-land20111109googles-piracy-liability ltviewed 11 January 2013gt

28 ldquoGovernment of Canada Web Archiverdquo Library and Archives Canada 17October 2007 wwwcollectionscanadagccawebarchivesindex-ehtmlltviewed 11 January 2013gt

29 For background see Bill Curry ldquoVisiting Library and Archives inOttawa Not Without an Appointmentrdquo Globe and Mail 1 May 2012wwwtheglobeandmailcomnewspoliticsottawa-notebookvisiting-library-and-archives-in-ottawa-not-without-an-appointmentarticle2418960 ltviewed 22 May 2012gt and the announcement ldquoLAC Begins Implementation of New Approach to service Deliveryrdquo Library and Archives Canada February 2012wwwcollectionscanadagccawhats-new013-560-ehtml ltviewed 22May 2012gt For professional responses see the ldquoCHArsquos comment onNew Directions for Library and Archives Canadardquo CHA Website March2012 wwwcha-shccaenNews_39items19html ltviewed 19December 2012gt and ldquoSave Library amp Archives Canadardquo CanadianAssociation of University Teachers launched AprilMay 2012 wwwsavelibraryarchivesca ltviewed 19 December 2012gt

30 Ian Milligan ldquoThe Smokescreen of lsquoModernizationrsquo at Library andArchives Canadardquo ActiveHistoryca 22 May 2012wwwactivehistoryca201205the-smokescreen-of-modernization-at-library-and-archives-canada ltviewed 19 December 2012gt

31 ldquoSave LAC May 2012 Campaign Updaterdquo SaveLibraryArchivesca May2012 wwwsavelibraryarchivescaupdate-2012-05aspx ltviewed 19December 2012gt

32 The challenges and opportunities presented by such projects is extensively discussed in Daniel J Cohen ldquoThe Future of Preserving the Pastrdquo CRM The Journal of Heritage Stewardship 2 no 2

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

(Summer 2005) 6ndash19 also available freely online at the RoyRosenzweig Center for History and New Mediawwwchnmgmueduessays-on-history-new-mediaessaysessayid=39For more see ldquoSeptember 11 Digital Archive Saving the Histories of September 11 2001rdquo Centre for History and New Media andAmerican Social History Project Center for Media and Learningwww911digitalarchiveorg ltviewed 11 January 2013gt ldquoHurricaneDigital Memory Bank Collecting and Preserving the Stories of Katrina and Ritardquo Center for History and New Media wwwhurricanearchiveorg ltviewed 11 January 2013gt and ldquoOccupyArchive Archiving the Occupy Movements from 2011rdquo Centre forHistory and New Media wwwoccupyarchiveorg ltviewed 11 January2013gt

33 Daniel Cohen and Roy Rosenzweig Digital History A Guide to Gathering Preserving and Presenting the Past on the Web(Philadelphia Temple University Press 2005) chapter on ldquoPreservingDigital Historyrdquo The chapter is available fully online atwwwchnmgmuedudigitalhistorypreserving5php ltviewed 19December 2012gt

34 For an overview see ldquoPreserving Our Digital Heritage The NationalDigital Information Infrastructure and Preservation Program 2010 Report a Collaborative Initiative of the Library of CongressrdquoDigitalpreservationgov wwwdigitalpreservationgovmultimediadocumentsNDIIPP2010Report_Postpdf ltviewed 19 December 2012gt

35 Steve Meloan ldquoNo Way to Run a Culturerdquo Wired 13 February 1998available online via the Internet Archivewwwwebarchiveorgweb20000619001705httpwwwwiredcomnewsculture012841030100html ltviewed 19 December 2012gt

36 Some of this has been discussed in greater depth on my website See theseries beginning with Ian Milligan ldquoWARC Files A Challenge forHistorians and Finding Needles in Haystacksrdquo ianmilliganca 12December 2012 wwwianmilliganca20121212warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks ltviewed 20 December2012gt

37 Unstructured data as opposed to structured data is important for histo-rians but brings additional challenges Some projects have producedstructured data novels marked up using XML encoding for example soone can immediately see place names dates character names and soforth This makes it much easier for a computer to read and understand

38 See William Noblett ldquoDigitization A Cautionary Talerdquo New Review ofAcademic Librarianship 17 (2011) 3 Tobias Blanke Michael Bryantand Mark Hedges ldquoOcropodium Open Source OCR for Small-ScaleHistorical Archivesrdquo Journal of Information Science 38 no 1 (2012) 83

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 23: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

cookies (little files kept on your system in this case one that notedthat you had seen the disclaimer about the archival nature of thewebsite) and then with an one-line command in my command lineto download the entire database52 Due to the imposition of a ran-dom wait and bandwidth limit which meant that I would downloadone file every half to one and a half seconds at a maximum of 30kilobytes a second the process took six days In the ProgrammingHistorian 2 I have an article expanding upon wget for humanitiesresearchers

Hypertext itself deserves a brief discussion The basic buildingblock of the web is the link one page is connected to another andpeople navigate through these webs While books and articles forexample are connected to each other by virtue of ideas and concretecitations these must often be inferred online HTML born-digitalsources are different in that the link offers a very concrete file One

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Visualizing the CDC Collection as a Series of Hyperlinks

quick visualization the circular cluster seen above lets us see theunique nature of this information We can go to each web pagescrape the links and see where they take viewers As we see in thebelow figure the CDC collection can be visualized as a series ofhyperlinks (each line) The cluster in the middle is the CDC collec-tion itself whereas the strands on the outside represent externalwebsites Using Mathematica I have generated a dynamic modelyou can move your mouse over each line learn what link it is andthus see where the collection fits into the broader Internet The siteis limited to following two links beyond the first page Through thisapproach we see the interconnection of knowledge and relationshipsbetween different sources At a glimpse we can see suggestions to fol-low other sites and determine if any nodes are shared amongst thevarious websites revealed

Visualization Moving Beyond Word Clouds

Where to begin with a large collection such as the Canadarsquos DigitalCollection corpus As a historian who is primarily interested in thewritten word my first step was to isolate the plaintext throughout thematerial so that I could begin rudimentary data mining and textualanalysis on the corpus For this I approached the CDC collection ontwo distinct levels mdash first the aggregate collection and second divid-ing the collection into each lsquofilersquo or individual website mdash derivingword and phrase data for each Imagine if you could look throughinto an archival collection and get a quick sense of what topics andideas were covered Taking that thought further imagine the utility ifyou could do this with individual archival boxes and even files Witha plethora of born-digital sources this is now a necessary step

On an aggregate level word and phrase data is a way to explorethe information within How often do various words within thenearly eight million words of the collection appear UsingMathematica we can quickly generate a list of words sorted by theirrelative frequency we learn that the word ldquoCanadardquo appears 26429times ldquocollectionsrdquo 11436 and ldquo[L]ouisbourgrdquo an astounding9365 times Reading such a list can quickly wear out a reader and isof minimal visual attractiveness Two quick decisions were made to

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

enhance data usability first the information was ldquocase-foldedrdquo allsent to lower-case this enables us to count the total appearances of aword such as ldquodigitalrdquo regardless of whether or not it appears at thebeginning of a sentence

Second when dealing with word frequency mdash as opposed tophrase frequency mdash ldquostop wordsrdquo should be removed as they werehere These are ldquoextremely common words which would appear to beof little value in helping select documents matching a user needrdquowords such as ldquotherdquo ldquoandrdquo ldquotordquo and so forth They matter forphrases (the example ldquoto be or not to berdquo is comprised entirely ofstop words) but will quickly dominate any word-level visualizationor list53 Other words may need to be added or removed from stoplists depending on the corpus for example if the phrase ldquopage num-berrdquo appeared on every page that would be useless information forthe researcher

The first step then is to use Mathematica to format word fre-quency information into Wordlenet readable format (if we put allof the text into that website it will crash mdash so we need to followthe instructions on that website) we can then draw up a wordcloud of the entire collection Here we can see both the advantagesand disadvantages of static word clouds What can we learn from this

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A Word Cloud of All Words in the CDC Corpus

illustration First without knowing anything about the sites we canimmediately learn that it is dealing with CanadaCanadians that itis a collection with documents in French and English that we aredealing with digitized collections and that there is a fair amount ofregional representation not based upon population (note thatldquoontariordquo is less frequent than both ldquonewfoundlandrdquo ldquonova scotiardquoand ldquoalbertardquo although ldquoqueacutebecrdquo and ldquoquebecrdquo is not terribly promi-nent) Certain kinds of history dominate communities familynames the built form (houses churches schools and so forth) At avery macro level we can learn several characteristics before doing anydeeper investigation

These forms of visualization need to be used with extreme cau-tion however First while visually attractive it leaves a severeamount of ambiguity are ldquoontariordquo and ldquoqueacutebecrdquo equally promi-nent for example or is there a slight divergence Even moreproblematically we cannot learn context from the above picturedoes the word ldquoschoolrdquo refer to built structures or educational expe-riences The same question with applies to ldquochurchrdquo Do mentionsof ldquonewfoundlandrdquo refer to historical experiences there or withexpatriates or in travel narratives Word clouds are useful first stepsand have their place but need to be used with extreme caution Formore scholarly applications tables offer much more precise inter-pretation As the second figure below demonstrates they can bemore helpful when dealing with the individual sites cuttingthrough ambiguity The figure below is a visualization of the web-site ldquopast and presentrdquo Despite the ambiguous title we can quicklylearn what this website contains from this visualization Wordle isthus a useful first step analogous to taking magically taking a quicklook through the side of an archival box We still need more infor-mation however to really make this form of textual analysishelpful To navigate this problem I have created my own dynamicfinding aid program using Mathematica which scales to any born-digital web collection

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Two approaches were taken First a textual representation of wordand phrase frequency allows us to determine the content of a given filerelatively quickly I wrote a short program to extract word frequency(stop words excluded) and phrase frequency For phrases we speak ofn-grams bigrams trigrams quadgrams and fivegrams although onecould theoretically do any higher number A bigram can be a combina-tion of two characters such as lsquobirsquo two syllables or mdash pertinent in ourcase mdash two words An example ldquocanada isrdquo A trigram is three wordsquadgram is four words and a fivegram is five words and so on

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Word Cloud of the File ldquoPast and Presentrdquo

A viewer for n-grams in the CDC Collection

The CDC collection demonstrates this approachrsquos utility Usingmy Mathematica textual CDC browser pictured above I randomlyexplored file 267 the ambiguously named ldquopast and presentrdquo web-site This is one of the largest individual websites 122363 words orabout 493 standard typed pages Such an ambiguous name could beabout almost anything Initially I decided to study word frequency

ldquoldquoalbertardquo 758rdquo ldquoldquolistenrdquo 490rdquo ldquoldquosettlersrdquo 467rdquoldquoldquonewrdquo 454rdquo ldquoldquocanadardquo 388rdquo ldquoldquolandrdquo 362rdquo ldquoldquoreadrdquo328rdquo ldquoldquopeoplerdquo 295rdquo ldquoldquosettlementrdquo 276rdquo ldquoldquochineserdquo268rdquo ldquoldquocanadianrdquo 264rdquo ldquoldquoplacerdquo 255rdquo ldquoldquoheritagerdquo254rdquo ldquoldquoedmontonrdquo 250rdquo ldquoldquoyearsrdquo 238rdquo ldquoldquowestrdquo 236rdquoldquoldquoraymondrdquo 218rdquo ldquoldquoukrainianrdquo 216rdquo ldquoldquocamerdquo 213rdquoldquoldquocalgaryrdquo 209rdquo ldquoldquonamesrdquo 206rdquo ldquoldquoranchrdquo 204rdquoldquoldquoupdatedrdquo 193rdquo ldquoldquohistoryrdquo 190rdquo ldquoldquocommunitiesrdquo183rdquo hellip

Rather than having to visit the website at a glance we can almostimmediately see what the site is about More topics become apparentwhen one looks at trigrams for example in this automatically gen-erated output of generated top phrases

ldquoldquofirstrdquo ldquopeoplerdquo ldquosettlersrdquo 100rdquo ldquoldquotherdquo ldquounitedrdquoldquostatesrdquo 72rdquo ldquoldquoonerdquo ldquoofrdquo ldquotherdquo 59rdquo ldquoldquoofrdquo ldquotherdquoldquowestrdquo 58rdquo ldquoldquosettlersrdquo ldquolastrdquo ldquoupdatedrdquo 56rdquo ldquoldquopeoplerdquoldquosettlersrdquo ldquolastrdquo 56rdquo ldquoldquoalbertansrdquo ldquofirstrdquo ldquopeoplerdquo 56rdquoldquoldquoadventurousrdquo ldquoalbertansrdquo ldquofirstrdquo 56rdquo ldquoldquotherdquo ldquobarrdquoldquourdquo 55rdquo ldquoldquoopeningrdquo ldquoofrdquo ldquotherdquo 54rdquo ldquoldquocommunitiesrdquoldquoadventurousrdquo ldquoalbertansrdquo 54rdquo ldquoldquonewrdquo ldquocommunitiesrdquoldquoadventurousrdquo 54rdquo ldquoldquobarrdquo ldquourdquo ldquoranchrdquo 53rdquo ldquoldquolistenrdquoldquotordquo ldquotherdquo 52rdquo hellip

Through a few clicks we can navigate and determine the basiccontours mdash from a distance mdash of what the site pertains to In aviewer you can use a slider to dynamically see more informationor less information quickly toggling between bigrams trigramsand quadgrams allowing you to quickly hone in on immediately

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

relevant information This can be useful but we need to moveback even further to achieve a distant reading of this born-digitalcollection

If one zooms back out to this very distant level we can extractn-gram data from the entire collection We have two approachesfirst aggregate appearances of a given term (if ldquoaboriginalrdquo appears457 times then n=457) or second relative appearances of a giventerm (if ldquoaboriginalrdquo appears in 45 of 384 sites then n=45384 or01171875) In my other work on music lyrics the latter optionenables you to control for the lsquochorusrsquo effect as the former can be drowned out by one song with repetitive words Given thenature of this collection I elected for the former approach Whatcan we learn from trigrams for example After an initial set of trigrams pertaining to the Louisbourg digitization project whichhas boilerplate language on each page as well as common ldquostop-wordrdquo equivalents of phrases (especially pronounced in French) we come across the first set of meaningful phrases excerptedbelow

ldquoexperiencerdquo ldquoinrdquo ldquotherdquo 919 ldquotransferredrdquo ldquotordquo ldquotherdquo918 ldquofundedrdquo ldquobyrdquo ldquotherdquo 918 ldquoofrdquo ldquotherdquo ldquoprovincialrdquo915 ldquothisrdquo ldquowebrdquo ldquositerdquo 915 ldquo409rdquo ldquoregistryrdquo ldquo2rdquo914 ldquoarerdquo ldquonordquo ldquolongerrdquo 914 ldquogouvernementrdquo ldquodurdquo ldquocanadardquo 910 ldquouserdquo ldquothisrdquo ldquolinkrdquo 887ldquopagerdquo ldquoorrdquo ldquoyourdquo 887 ldquofollowingrdquo ldquopagerdquo ldquoorrdquo 887 ldquoautomaticallyrdquo ldquotransferredrdquo ldquotordquo 887 ldquoberdquoldquoautomaticallyrdquo ldquotransferredrdquo 887 ldquowillrdquo ldquoberdquo ldquoautomaticallyrdquo 887 ldquoindustryrdquo ldquoyourdquo ldquowillrdquo 887ldquomultimediardquo ldquoindustryrdquo ldquoyourdquo 887 ldquopracticalrdquo ldquoexperiencerdquo ldquoinrdquo 887 ldquogainedrdquo ldquopracticalrdquo ldquoexperiencerdquo887 ldquotheyrdquo ldquogainedrdquo ldquopracticalrdquo 887 ldquowhilerdquo ldquotheyrdquoldquogainedrdquo 887 ldquocanadiansrdquo ldquowhilerdquo ldquotheyrdquo 887ldquoyoungrdquo ldquocanadiansrdquo ldquowhilerdquo 887 ldquoshowcaserdquo ldquoofrdquoldquoworkrdquo 887 ldquoexcellentrdquo ldquoshowcaserdquo ldquoofrdquo 887 hellip

Here we see an array of common language funding informa-tion link information project information spread across over 887

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

websites Reading down we see information about governments jobexperience project information initiatives re-structuring and soforth At a glance we can see what language is common to all web-sites as opposed to unique information Moving down we find themore common but less universal ldquothe history ofrdquo for example ldquoher-itage community foundationrdquo libraries and so forth By the end ofthe list we can move into the unique or very rare phrases that appearonly once twice or three times

Moving beyond word and phrase frequency there are a fewmore sophisticated off-the-shelf options that can help us make senseof large arrays of unstructured data Topic modeling is among themost developed as Princetonrsquos David M Blei (a Computer Scienceprofessor) describes in a general introduction to the topic

While more and more texts are available online we simplydo not have the human power to read and study them toprovide the kind of [traditional] browsing experiencedescribed above To this end machine learning researchershave developed probabilistic topic modeling a suite of algo-rithms that aim to discover and annotate large archives ofdocuments with thematic information Topic modelingalgorithms are statistical methods that analyze the words ofthe original texts to discover the themes that run throughthem how those themes are connected to each other andhow they change over time hellip Topic modeling enables usto organize and summarize electronic archives at a scalethat would be impossible by human annotation55

Using the aforementioned SEASR Meandre workbench (or viaMALLET itself and a command-line interface) we can quickly beginto run our born-digital information through the topic-modelingalgorithm The results are notable and come in two formats a wordcloud visualization as well as an XML file containing a line-by-linebreakdown of each topic Taking the aforementioned file 267 (ldquopastand presentrdquo) we can begin to get a sense of the topics this large web-site discusses

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

At a glance we can learn about some of the main themes the struc-ture of Alberta (Alberta towns bridges lakes) the geographic shapeof the area (life land lands areas links grass and interestingly lan-guage) the stories (story knight archives but also interestinglyknight and wife) and so forth The word cloud gives us a sense ofthe most relevant terms that define each topic but we also have a3200-line XML file broken into topics which lets us further refineour inquiry

A last set of examples will help us move from the aggregate levelof hundreds of thousands of lines of text to the individual line inessence creating our own search engine to navigate this collection

There are a few good reasons why we should not use Google todo all of this in our own work First we do not know how Googleworks precisely It is a proprietary black box While the original algo-rithm is available the actual methods that go into determining theranking of search terms is a trade secret Second Google is optimizedfor a number of uses mainly to deliver the most relevant informa-tion to the user Historians and archivists are looking often forspecific information within old websites and it behooves us to exam-ine them through the lens of more specialized tools With thesehesitations in mind I have sought to develop my own rudimentarysearch engine for this born-digital collection

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A two-pronged approach can take us to relevant informationquickly In the following example I am carrying out a project onblack history We quickly search all of the CDC word frequency filesfor the occurrence of the word ldquofreedomrdquo ldquoblackrdquo and ldquolibertyrdquo(these terms of course require some background knowledge thesearch is shaped by the user) obtain a list and then map it onto achart in order to quickly visualize their locations in the collection Forexample in the graph above we see a visualization of a given term inthis case our three search terms As one moves the mouse over eachbar the frequency data for a given term appears We can also zoomand enhance this graph to use it as we wish A demonstration of thiscan be found online at wwwianmilliganca20130118cdc-tool-1

We can then move down into the data In the example aboveldquo047rdquo labeled ldquoblackgoldtxtrdquo contains 31 occurrences of the wordldquoontariordquo Moving to my Mathematica keyword-in-context routinewe can see where these occurrences of the word appear

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

A snapshot of the dynamic finding aid tool

We see that this website focuses mostly on southwestern Ontariowhere many oil fields are If we move forward we can see that it per-tains to the nineteenth century a particular region as well assubsequent memorialization activities (mention of a museum forexample) All this can be done from within the textual client asopposed to going out into individual websites Broad contoursemerge from this distant reading but they also facilitate deeper read-ing by allowing us to promptly realize critical sources

Another model through which we can visualize documents isthrough a viewer primarily focusing on term document frequencyand inverse frequency terms discussed below To prepare this mater-ial we are combining our earlier data about word frequency (that theword ldquoexperiencerdquo appears 919 times perhaps) with aggregate infor-mation about frequency in the entire corpus (ie the wordldquoexperiencerdquo appears 1766 times) The number of times that a termappears in a given website can be defined as tf (term frequency)Another opportunity is inverse document frequency where we canlearn what words are most distinct and unique in a given documentThe formula for this is the oft-cited tf-idf a staple of informationretrieval theory or tf times the logarithm of the number of docu-ments divided into the number of documents where the given termappears At a glance then we can see some of the unique words con-tained within a website if we select ldquoinverse document frequencyrdquo

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

At a very quick glance one can see the major words that make this aunique document This technique can be used on any textual docu-ment

Off-the-shelf tools can facilitate a deeper yet distant reading ofindividual websites as well especially the suite of research tools onwebsites such as Many Eyes and Voyant Using these tools howeverrequires that one puts individual data into each separate programrather than quickly navigating between them This stems from themassive amount of files and data involved Many Eyes limits files to5MB and Voyant has no hard and fast rule but can struggle withlarger files

This analysis was carried out on my personal computer55 Largerdata sets may necessitate a corresponding increase in computationalpower bringing social sciences and humanities researchers into con-tact with High Performance Computing (HPC) centres or consortiaThese extremely powerful computers provide researchers with com-puting power orders of magnitude above that of personal computersallowing results in hours instead of the weeks or months it might

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

take to process on a smaller system56 This has particular applicationin the basic processing of a very large web archive There are limita-tions however jobs must be submitted into queues in advancerather than on-site processing and perhaps most importantly HPCcentres prefer to run highly optimized code to minimize usage oftheir very expensive and scarcely allocated equipment As this level ofcoding expertise is often beyond humanities researchers researchteams must be formed

Humanities researchers would do well to familiarize themselveswith local HPC options many of which are more than happy towork on new and engaging questions These are accessible optionsin Ontario for example all but one university belongs to a HPCconsortium57 Yet as personal computing continues to dramaticallyincrease in speed and as storage prices continue to plummet textualanalysis is increasingly feasible on high-quality office systems An ini-tial exposure can generate questions and problems that if the needrears itself can subsequently be taken to the supercomputer

This case study is intended to provide one model for distantreading and the generation of finding aids in the interest of showingsome of the potential for working with born-digital sources It isintended to begin or jumpstart a conversation around the resourcesand skills necessary for the next generation A seemingly insur-mountable amount of information contained across over 73000files (as in the CDC collection) can be quickly navigated and iso-lated for researcher convenience

Conclusion Digital Literacy for the Next Generation ofHistorians

As a discipline I would like to suggest one path forward prepare thenext generation of historians at least some of them to deal with thedeluge of born-digital sources Fundamentally the challenge of born-digital sources is a pedagogical one For existing historians this is anissue of continual education for future historians an issue of profes-sional training (the annual Digital Humanities Summer Instituteheld at the University of Victoria each June is worth considering asan introduction to many digital concepts) Some off-the-shelf tools

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

can serve as gateways to deeper engagement notably the VoyantTools suite (wwwvoyant-toolsorg) and perhaps subsequently thelessons contained in the Programming Historian 2 That said digitalliteracy courses should become part of the essential pantheon of his-torical curricula This should not be mandatory but instead taughtalongside and within traditional historiography and research meth-ods courses In an era of diminishing resources and attention paid toknowledge translation and relevancy digital literacy courses are alsowell positioned to provide graduates with concrete portfolios andactionable skills If we want to be prepared to use digital method-ologies to study the post-1990 world we will need ready historians

Such a course should have six basic priorities First we must pro-vide historiographical context for new students on whatcomputational methods have been tried before How have quantita-tive and qualitative methods differed yet ultimately complementedeach other Second students in this course would need to develop anunderstanding of the digital basics cloud computing effectivebackup and versioning digital security and an understanding of howthe Internet is organized Third for effective cross-departmental andsustained engagement with digital history students must to learn howto digitally navigate conventional sources notable skill sets wouldinclude a familiarity with citation management software databasesand the possibility of using optical character recognition on a varietyof born-paper sources subsequently digitized through scanners orcameras This must also include a grasp of the scope and significanceof large digital depositories including Google Books the InternetArchive and others mentioned earlier Fourth students must learnhow to process large amounts of digital information this includes off-the-shelf visualization tools material discussed here and some of thephilosophical and methodological implications of a distance readingFifth students can learn basic programming through the ProgrammingHistorian Finally and perhaps most importantly students need anunderstanding of various ways to openly disseminate material on the web whether through blogging micro-blogging publishing webpages and so forth it is important to enable the great work that we do in the university to make its mark on the world Throughthis students will also come away from history departments with

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

comprehensive and marketable digital portfolios Such a coursewould focus more on homework than reading with special emphasisplaced on deliverable content such as websites interactive modelsand broadly implementable databases

Digital history and the readiness to work with born-digitalsources cannot stem out of pedagogy alone however Institutionalsupport is necessary While based on anecdotal evidence it appearsthat some university administrators have begun to embrace the shifttowards the digital humanities At the federal level SSHRC has iden-tified the Digital Economy as one of its priority areas58 Yet when itcomes to evaluation doctoral and post-doctoral researchers do nothave a ldquodigital humanitiesrdquo option open to them in terms of assessingcommittees leading to fears that their projects might be lost inbetween disciplinary areas of focus59 In light of the impending crisisof born-digital sources historians should encourage these initiatives

These sorts of digital projects do not always lend themselveswell to traditional publication venues The examples above from theCDC case study for example are best viewed as dynamically mal-leable models this will be one challenge as historians grapple withnew publication models This article argues first why digitalmethodologies are necessary and second how we can realize aspectsof this call The Internet is 20 years old and before historians realizeit we will be faced with the crisis of born-digital information over-load If history is to continue as the leading discipline inunderstanding the social and cultural past decisive movementtowards the digital humanities is necessary When the sea comes inwe must be ready to swim

IAN MILLIGAN is an assistant professor of Canadian history at theUniversity of Waterloo He is a founding co-editor of ActiveHistorycaan editor-at-large of the Programming Historian 2 and has written sev-eral articles in the fields of Canadian youth digital and labour history

IAN MILLIGAN est professeur adjoint en histoire canadienne agrave laUniversity of Waterloo Il est le fondateur et codirecteur du site

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Internet ActiveHistoryca ainsi qursquoun eacutediteur du site ProgrammingHistorian 2 Il a eacutecrit plusieurs articles au sujet de la jeunesse cana-dienne de la numeacuterisation et de lrsquohistoire des travailleurs

Endnotes

1 The question of when the past becomes a ldquohistoryrdquo to be professionallystudied is a tricky one Yet it is worth noting that the still-definitivework on the Baby Boom Generation appeared in 1996 with DougOwram Born at the Right Time A History of the Baby Boom Generationin Canada (Toronto University of Toronto Press 1996) This actuallyfollowed the earlier scholarly monograph Cyril Levitt Children ofPrivilege Student Revolt in the Sixties (Toronto University of TorontoPress 1984) and a host of other laudable works As academicdebatesworks on the 1960s now suggest the past becomes historysooner than we often expect

2 A critical point advanced in the field of literary criticism and theory byStephen Ramsay Reading Machines Towards an Algorithmic Criticism(Urbana University of Illinois Press 2011)

3 See ldquoIdleNoMore Timelinerdquo ongoingwwwidlenomoremakookcatimeline44 ltviewed 18 January 2013gt

4 These issues have been discussed elsewhere albeit in either a moreabridged form or with a focus on digitized historical sources A notableexception is Daniel J Cohen and Roy Rosenzweig Digital History AGuide to Gathering Preserving and Presenting the Past on the Web(Philadelphia University of Pennsylvania Press 2005) specifically theirchapter on ldquoCollecting History onlinerdquo Their focus is on the preservation-ist and archival challenges however Important notes are also found inDan Cohen ldquoHistory and the Second Decade of the Webrdquo RethinkingHistory 8 no 2 (June 2004) 293ndash301 also available online atwwwchnmgmueduessays-on-history-new-mediaessaysessayid=34

5 Figure taken from Jose Martinez ldquoTwitter CEO Dick Costolo Reveals Staggering Number of Tweets Per Dayrdquo ComplexTech wwwcomplexcomtech201210twitter-ceo-dick-costolo-reveals-staggering-number-of-tweets-per-day ltviewed 6 January 2013gtFurther more while a tweet itself is only 140 characters every tweetcomes with over 40 lines of metadata date information author biogra-phy and location information when the account was created the usersrsquofollowing and follower numbers country data and so forth This is aremarkable treasure trove of information for future historians but also apotential source of information overload See Sarah Perez ldquoThis is What

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

a Tweet Looks Likerdquo ReadWriteWeb (19 April 2010) wwwreadwritewebcomarchivesthis_is_what_a_tweet_looks_likephpltviewed 6 January 2013gt

6 This is further discussed in Ian Foster ldquoHow Computation ChangesResearchrdquo in Switching Codes Thinking Through Digital Technology inthe Humanities and the Arts eds Thomas Bartscherer and RoderickCoover (Chicago University of Chicago Press 2011) esp 22ndash4

7 Franco Moretti Graphs Maps Trees Abstract Models for Literary History(New York Verso 2005) 3ndash4

8 While not of prime focus here the longue dureacutee deserves definition Itstands opposite the history of events of instances covering instead avery long time span in an interdisciplinary social science framework In along essay Braudel noted that one can then see both structural crises ofeconomies and structures that constrain human society and develop-ment For more see Fernand Braudel ldquoHistory and the Social SciencesThe Longue Dureacuteerdquo in Fernand Braudel On History trans SarahMatthews (Chicago University of Chicago Press 1980) 27 The essaywas originally printed in the Annales ESC no 4 (OctoberndashDecember1958)

9 A great overview can be found at ldquoText Analysisrdquo Tooling Up for DigitalHumanities 2011 wwwtoolingupstanfordedupage_id=981 ltviewed19 December 2012gt

10 Jean-Baptiste Michel Yuan Kui Shen Aviva Presser Aiden Adrian VeresMatthew K Gray The Google Books Team Joseph P Pickett DaleHoiberg Dan Clancy Peter Norvig Jon Orwant Steven Pinker MartinA Nowak Erez Lieberman Aiden ldquoQuantitative Analysis of CultureUsing Millions of Digitized Booksrdquo Science 331 (14 January 2011)176ndash82 The tool is publicly accessible at ldquoGoogle Books NgramViewerrdquo Google Books 2010 wwwbooksgooglecomngrams ltviewed11 January 2013gt

11 Anthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

12 Comment by Jean-Baptiste Michel and Erez Lieberman Aiden onAnthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

13 For more information see the Canadian Century Research InfrastructureProject website at wwwccriuottawacaCCRIHomehtml and the Programme de recherche en deacutemographie historique at wwwgeneologyumontrealcafrleprdhhtm

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

14 ldquoWith Criminal Intentrdquo CriminalIntentorg wwwcriminalintentorgltviewed 20 December 2012gt

15 For an overview see ldquoAward Recipients mdash 2011rdquo Digging into DataWebpage wwwdiggingintodataorgHomeAwardRecipients2011tabid185Defaultaspx ltviewed 20 December 2012gt

16 One can do no better than the completely free Programming Historianwhich uses entirely free and open-source programming tools to learn thebasics of computational history See The Programming Historian 2 2012-Present wwwprogramminghistorianorg ltviewed 11 January 2013gt

17 Michael Katz The People of Hamilton Canada West Family and Class ina Mid-Nineteenth-Century City (Cambridge MA Harvard UniversityPress 1975) Also see A Gordon Darroch and Michael D OrnsteinldquoEthnicity and Occupational Structure in Canada in 1871 The VerticalMosaic in Historical Perspectiverdquo Canadian Historical Review 61 no 3(1980) 305ndash33

18 Ian Anderson ldquoHistory and Computingrdquo Making History The ChangingFace of the Profession in Britain 2008 wwwhistoryacukmakinghistoryresourcesarticleshistory_and_computinghtml ltviewed 20 December2012gt

19 Christopher Collins ldquoText and Document Visualizationrdquo TheHumanities and Technology Camp Greater Toronto Area VisualizationBootcamp University of Ontario Institute of Technology Oshawa Ont21 October 2011

20 The quotations by Luther and Poe are commonly used to make thispoint For the best example see Clay Shirky Cognitive SurplusCreativity and Generosity in a Connected Age Google eBook Penguin2010 For Mumford see Lewis Mumford The Myth of the Machine vol2 The Pentagon of Power (New York Harcourt Brace 1970) 182 asquoted in Gleick The Information A History A Theory A Flood (NewYork Pantheon 2011) 404

21 Adam Ostrow ldquoCisco CRS-3 Download Library of Congress in JustOver One Secondrdquo Mashable (9 March 2010) accessed onlinewwwmashablecom20100309cisco-crs-3

22 A note on data size is necessary The basic unit of information is a ldquobyterdquo (8 bits) which is roughly a single character A kilobyte (KB) is athousand bytes a megabyte (MB) a million a gigabyte (GB) a billion aterabyte (TB) a trillion and a petabyte (PB) a quadrillion Beyond thisthere are exabytes (EB) zettabytes (ZB) and yottabytes (YB)

23 These figures can of course vary depending on compression methodschosen Their math was based on an average of 8MB per book if it wasscanned at 600dpi and then compressed There are 26 million books inthe Library of Congress For more see Mike Ashenfelder ldquoTransferringlsquoLibraries of Congressrsquo of Datardquo Library of Congress Digital Preservation

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Blog 11 July 2011wwwblogslocgovdigitalpreservation201107transferring-libraries-of-congress-of-data ltviewed 11 January 2013gt

24 Gleick 232 25 ldquoAbout the Library of Congress Web Archivesrdquo Library of Congress Web

Archiving FAQs Undated wwwlocgovwebarchivingfaqhtmlfaqs_02ltviewed 11 January 2013gt

26 ldquo10000000000000000 Bytes Archivedrdquo Internet Archive Blog 26October 2012 wwwblogarchiveorg2012102610000000000000000-bytes-archivedltviewed 19 December 2012gt

27 Jean-Baptiste Michel et al 176ndash82 Published Online Ahead of Print16 December 2010 wwwsciencemagorgcontent3316014176ltviewed 29 July 2011gt See also Scott Cleland ldquoGooglersquos PiracyLiabilityrdquo Forbes 9 November 2011 wwwforbescomsitesscottcle-land20111109googles-piracy-liability ltviewed 11 January 2013gt

28 ldquoGovernment of Canada Web Archiverdquo Library and Archives Canada 17October 2007 wwwcollectionscanadagccawebarchivesindex-ehtmlltviewed 11 January 2013gt

29 For background see Bill Curry ldquoVisiting Library and Archives inOttawa Not Without an Appointmentrdquo Globe and Mail 1 May 2012wwwtheglobeandmailcomnewspoliticsottawa-notebookvisiting-library-and-archives-in-ottawa-not-without-an-appointmentarticle2418960 ltviewed 22 May 2012gt and the announcement ldquoLAC Begins Implementation of New Approach to service Deliveryrdquo Library and Archives Canada February 2012wwwcollectionscanadagccawhats-new013-560-ehtml ltviewed 22May 2012gt For professional responses see the ldquoCHArsquos comment onNew Directions for Library and Archives Canadardquo CHA Website March2012 wwwcha-shccaenNews_39items19html ltviewed 19December 2012gt and ldquoSave Library amp Archives Canadardquo CanadianAssociation of University Teachers launched AprilMay 2012 wwwsavelibraryarchivesca ltviewed 19 December 2012gt

30 Ian Milligan ldquoThe Smokescreen of lsquoModernizationrsquo at Library andArchives Canadardquo ActiveHistoryca 22 May 2012wwwactivehistoryca201205the-smokescreen-of-modernization-at-library-and-archives-canada ltviewed 19 December 2012gt

31 ldquoSave LAC May 2012 Campaign Updaterdquo SaveLibraryArchivesca May2012 wwwsavelibraryarchivescaupdate-2012-05aspx ltviewed 19December 2012gt

32 The challenges and opportunities presented by such projects is extensively discussed in Daniel J Cohen ldquoThe Future of Preserving the Pastrdquo CRM The Journal of Heritage Stewardship 2 no 2

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

(Summer 2005) 6ndash19 also available freely online at the RoyRosenzweig Center for History and New Mediawwwchnmgmueduessays-on-history-new-mediaessaysessayid=39For more see ldquoSeptember 11 Digital Archive Saving the Histories of September 11 2001rdquo Centre for History and New Media andAmerican Social History Project Center for Media and Learningwww911digitalarchiveorg ltviewed 11 January 2013gt ldquoHurricaneDigital Memory Bank Collecting and Preserving the Stories of Katrina and Ritardquo Center for History and New Media wwwhurricanearchiveorg ltviewed 11 January 2013gt and ldquoOccupyArchive Archiving the Occupy Movements from 2011rdquo Centre forHistory and New Media wwwoccupyarchiveorg ltviewed 11 January2013gt

33 Daniel Cohen and Roy Rosenzweig Digital History A Guide to Gathering Preserving and Presenting the Past on the Web(Philadelphia Temple University Press 2005) chapter on ldquoPreservingDigital Historyrdquo The chapter is available fully online atwwwchnmgmuedudigitalhistorypreserving5php ltviewed 19December 2012gt

34 For an overview see ldquoPreserving Our Digital Heritage The NationalDigital Information Infrastructure and Preservation Program 2010 Report a Collaborative Initiative of the Library of CongressrdquoDigitalpreservationgov wwwdigitalpreservationgovmultimediadocumentsNDIIPP2010Report_Postpdf ltviewed 19 December 2012gt

35 Steve Meloan ldquoNo Way to Run a Culturerdquo Wired 13 February 1998available online via the Internet Archivewwwwebarchiveorgweb20000619001705httpwwwwiredcomnewsculture012841030100html ltviewed 19 December 2012gt

36 Some of this has been discussed in greater depth on my website See theseries beginning with Ian Milligan ldquoWARC Files A Challenge forHistorians and Finding Needles in Haystacksrdquo ianmilliganca 12December 2012 wwwianmilliganca20121212warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks ltviewed 20 December2012gt

37 Unstructured data as opposed to structured data is important for histo-rians but brings additional challenges Some projects have producedstructured data novels marked up using XML encoding for example soone can immediately see place names dates character names and soforth This makes it much easier for a computer to read and understand

38 See William Noblett ldquoDigitization A Cautionary Talerdquo New Review ofAcademic Librarianship 17 (2011) 3 Tobias Blanke Michael Bryantand Mark Hedges ldquoOcropodium Open Source OCR for Small-ScaleHistorical Archivesrdquo Journal of Information Science 38 no 1 (2012) 83

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 24: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

quick visualization the circular cluster seen above lets us see theunique nature of this information We can go to each web pagescrape the links and see where they take viewers As we see in thebelow figure the CDC collection can be visualized as a series ofhyperlinks (each line) The cluster in the middle is the CDC collec-tion itself whereas the strands on the outside represent externalwebsites Using Mathematica I have generated a dynamic modelyou can move your mouse over each line learn what link it is andthus see where the collection fits into the broader Internet The siteis limited to following two links beyond the first page Through thisapproach we see the interconnection of knowledge and relationshipsbetween different sources At a glimpse we can see suggestions to fol-low other sites and determine if any nodes are shared amongst thevarious websites revealed

Visualization Moving Beyond Word Clouds

Where to begin with a large collection such as the Canadarsquos DigitalCollection corpus As a historian who is primarily interested in thewritten word my first step was to isolate the plaintext throughout thematerial so that I could begin rudimentary data mining and textualanalysis on the corpus For this I approached the CDC collection ontwo distinct levels mdash first the aggregate collection and second divid-ing the collection into each lsquofilersquo or individual website mdash derivingword and phrase data for each Imagine if you could look throughinto an archival collection and get a quick sense of what topics andideas were covered Taking that thought further imagine the utility ifyou could do this with individual archival boxes and even files Witha plethora of born-digital sources this is now a necessary step

On an aggregate level word and phrase data is a way to explorethe information within How often do various words within thenearly eight million words of the collection appear UsingMathematica we can quickly generate a list of words sorted by theirrelative frequency we learn that the word ldquoCanadardquo appears 26429times ldquocollectionsrdquo 11436 and ldquo[L]ouisbourgrdquo an astounding9365 times Reading such a list can quickly wear out a reader and isof minimal visual attractiveness Two quick decisions were made to

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

enhance data usability first the information was ldquocase-foldedrdquo allsent to lower-case this enables us to count the total appearances of aword such as ldquodigitalrdquo regardless of whether or not it appears at thebeginning of a sentence

Second when dealing with word frequency mdash as opposed tophrase frequency mdash ldquostop wordsrdquo should be removed as they werehere These are ldquoextremely common words which would appear to beof little value in helping select documents matching a user needrdquowords such as ldquotherdquo ldquoandrdquo ldquotordquo and so forth They matter forphrases (the example ldquoto be or not to berdquo is comprised entirely ofstop words) but will quickly dominate any word-level visualizationor list53 Other words may need to be added or removed from stoplists depending on the corpus for example if the phrase ldquopage num-berrdquo appeared on every page that would be useless information forthe researcher

The first step then is to use Mathematica to format word fre-quency information into Wordlenet readable format (if we put allof the text into that website it will crash mdash so we need to followthe instructions on that website) we can then draw up a wordcloud of the entire collection Here we can see both the advantagesand disadvantages of static word clouds What can we learn from this

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A Word Cloud of All Words in the CDC Corpus

illustration First without knowing anything about the sites we canimmediately learn that it is dealing with CanadaCanadians that itis a collection with documents in French and English that we aredealing with digitized collections and that there is a fair amount ofregional representation not based upon population (note thatldquoontariordquo is less frequent than both ldquonewfoundlandrdquo ldquonova scotiardquoand ldquoalbertardquo although ldquoqueacutebecrdquo and ldquoquebecrdquo is not terribly promi-nent) Certain kinds of history dominate communities familynames the built form (houses churches schools and so forth) At avery macro level we can learn several characteristics before doing anydeeper investigation

These forms of visualization need to be used with extreme cau-tion however First while visually attractive it leaves a severeamount of ambiguity are ldquoontariordquo and ldquoqueacutebecrdquo equally promi-nent for example or is there a slight divergence Even moreproblematically we cannot learn context from the above picturedoes the word ldquoschoolrdquo refer to built structures or educational expe-riences The same question with applies to ldquochurchrdquo Do mentionsof ldquonewfoundlandrdquo refer to historical experiences there or withexpatriates or in travel narratives Word clouds are useful first stepsand have their place but need to be used with extreme caution Formore scholarly applications tables offer much more precise inter-pretation As the second figure below demonstrates they can bemore helpful when dealing with the individual sites cuttingthrough ambiguity The figure below is a visualization of the web-site ldquopast and presentrdquo Despite the ambiguous title we can quicklylearn what this website contains from this visualization Wordle isthus a useful first step analogous to taking magically taking a quicklook through the side of an archival box We still need more infor-mation however to really make this form of textual analysishelpful To navigate this problem I have created my own dynamicfinding aid program using Mathematica which scales to any born-digital web collection

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Two approaches were taken First a textual representation of wordand phrase frequency allows us to determine the content of a given filerelatively quickly I wrote a short program to extract word frequency(stop words excluded) and phrase frequency For phrases we speak ofn-grams bigrams trigrams quadgrams and fivegrams although onecould theoretically do any higher number A bigram can be a combina-tion of two characters such as lsquobirsquo two syllables or mdash pertinent in ourcase mdash two words An example ldquocanada isrdquo A trigram is three wordsquadgram is four words and a fivegram is five words and so on

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Word Cloud of the File ldquoPast and Presentrdquo

A viewer for n-grams in the CDC Collection

The CDC collection demonstrates this approachrsquos utility Usingmy Mathematica textual CDC browser pictured above I randomlyexplored file 267 the ambiguously named ldquopast and presentrdquo web-site This is one of the largest individual websites 122363 words orabout 493 standard typed pages Such an ambiguous name could beabout almost anything Initially I decided to study word frequency

ldquoldquoalbertardquo 758rdquo ldquoldquolistenrdquo 490rdquo ldquoldquosettlersrdquo 467rdquoldquoldquonewrdquo 454rdquo ldquoldquocanadardquo 388rdquo ldquoldquolandrdquo 362rdquo ldquoldquoreadrdquo328rdquo ldquoldquopeoplerdquo 295rdquo ldquoldquosettlementrdquo 276rdquo ldquoldquochineserdquo268rdquo ldquoldquocanadianrdquo 264rdquo ldquoldquoplacerdquo 255rdquo ldquoldquoheritagerdquo254rdquo ldquoldquoedmontonrdquo 250rdquo ldquoldquoyearsrdquo 238rdquo ldquoldquowestrdquo 236rdquoldquoldquoraymondrdquo 218rdquo ldquoldquoukrainianrdquo 216rdquo ldquoldquocamerdquo 213rdquoldquoldquocalgaryrdquo 209rdquo ldquoldquonamesrdquo 206rdquo ldquoldquoranchrdquo 204rdquoldquoldquoupdatedrdquo 193rdquo ldquoldquohistoryrdquo 190rdquo ldquoldquocommunitiesrdquo183rdquo hellip

Rather than having to visit the website at a glance we can almostimmediately see what the site is about More topics become apparentwhen one looks at trigrams for example in this automatically gen-erated output of generated top phrases

ldquoldquofirstrdquo ldquopeoplerdquo ldquosettlersrdquo 100rdquo ldquoldquotherdquo ldquounitedrdquoldquostatesrdquo 72rdquo ldquoldquoonerdquo ldquoofrdquo ldquotherdquo 59rdquo ldquoldquoofrdquo ldquotherdquoldquowestrdquo 58rdquo ldquoldquosettlersrdquo ldquolastrdquo ldquoupdatedrdquo 56rdquo ldquoldquopeoplerdquoldquosettlersrdquo ldquolastrdquo 56rdquo ldquoldquoalbertansrdquo ldquofirstrdquo ldquopeoplerdquo 56rdquoldquoldquoadventurousrdquo ldquoalbertansrdquo ldquofirstrdquo 56rdquo ldquoldquotherdquo ldquobarrdquoldquourdquo 55rdquo ldquoldquoopeningrdquo ldquoofrdquo ldquotherdquo 54rdquo ldquoldquocommunitiesrdquoldquoadventurousrdquo ldquoalbertansrdquo 54rdquo ldquoldquonewrdquo ldquocommunitiesrdquoldquoadventurousrdquo 54rdquo ldquoldquobarrdquo ldquourdquo ldquoranchrdquo 53rdquo ldquoldquolistenrdquoldquotordquo ldquotherdquo 52rdquo hellip

Through a few clicks we can navigate and determine the basiccontours mdash from a distance mdash of what the site pertains to In aviewer you can use a slider to dynamically see more informationor less information quickly toggling between bigrams trigramsand quadgrams allowing you to quickly hone in on immediately

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

relevant information This can be useful but we need to moveback even further to achieve a distant reading of this born-digitalcollection

If one zooms back out to this very distant level we can extractn-gram data from the entire collection We have two approachesfirst aggregate appearances of a given term (if ldquoaboriginalrdquo appears457 times then n=457) or second relative appearances of a giventerm (if ldquoaboriginalrdquo appears in 45 of 384 sites then n=45384 or01171875) In my other work on music lyrics the latter optionenables you to control for the lsquochorusrsquo effect as the former can be drowned out by one song with repetitive words Given thenature of this collection I elected for the former approach Whatcan we learn from trigrams for example After an initial set of trigrams pertaining to the Louisbourg digitization project whichhas boilerplate language on each page as well as common ldquostop-wordrdquo equivalents of phrases (especially pronounced in French) we come across the first set of meaningful phrases excerptedbelow

ldquoexperiencerdquo ldquoinrdquo ldquotherdquo 919 ldquotransferredrdquo ldquotordquo ldquotherdquo918 ldquofundedrdquo ldquobyrdquo ldquotherdquo 918 ldquoofrdquo ldquotherdquo ldquoprovincialrdquo915 ldquothisrdquo ldquowebrdquo ldquositerdquo 915 ldquo409rdquo ldquoregistryrdquo ldquo2rdquo914 ldquoarerdquo ldquonordquo ldquolongerrdquo 914 ldquogouvernementrdquo ldquodurdquo ldquocanadardquo 910 ldquouserdquo ldquothisrdquo ldquolinkrdquo 887ldquopagerdquo ldquoorrdquo ldquoyourdquo 887 ldquofollowingrdquo ldquopagerdquo ldquoorrdquo 887 ldquoautomaticallyrdquo ldquotransferredrdquo ldquotordquo 887 ldquoberdquoldquoautomaticallyrdquo ldquotransferredrdquo 887 ldquowillrdquo ldquoberdquo ldquoautomaticallyrdquo 887 ldquoindustryrdquo ldquoyourdquo ldquowillrdquo 887ldquomultimediardquo ldquoindustryrdquo ldquoyourdquo 887 ldquopracticalrdquo ldquoexperiencerdquo ldquoinrdquo 887 ldquogainedrdquo ldquopracticalrdquo ldquoexperiencerdquo887 ldquotheyrdquo ldquogainedrdquo ldquopracticalrdquo 887 ldquowhilerdquo ldquotheyrdquoldquogainedrdquo 887 ldquocanadiansrdquo ldquowhilerdquo ldquotheyrdquo 887ldquoyoungrdquo ldquocanadiansrdquo ldquowhilerdquo 887 ldquoshowcaserdquo ldquoofrdquoldquoworkrdquo 887 ldquoexcellentrdquo ldquoshowcaserdquo ldquoofrdquo 887 hellip

Here we see an array of common language funding informa-tion link information project information spread across over 887

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

websites Reading down we see information about governments jobexperience project information initiatives re-structuring and soforth At a glance we can see what language is common to all web-sites as opposed to unique information Moving down we find themore common but less universal ldquothe history ofrdquo for example ldquoher-itage community foundationrdquo libraries and so forth By the end ofthe list we can move into the unique or very rare phrases that appearonly once twice or three times

Moving beyond word and phrase frequency there are a fewmore sophisticated off-the-shelf options that can help us make senseof large arrays of unstructured data Topic modeling is among themost developed as Princetonrsquos David M Blei (a Computer Scienceprofessor) describes in a general introduction to the topic

While more and more texts are available online we simplydo not have the human power to read and study them toprovide the kind of [traditional] browsing experiencedescribed above To this end machine learning researchershave developed probabilistic topic modeling a suite of algo-rithms that aim to discover and annotate large archives ofdocuments with thematic information Topic modelingalgorithms are statistical methods that analyze the words ofthe original texts to discover the themes that run throughthem how those themes are connected to each other andhow they change over time hellip Topic modeling enables usto organize and summarize electronic archives at a scalethat would be impossible by human annotation55

Using the aforementioned SEASR Meandre workbench (or viaMALLET itself and a command-line interface) we can quickly beginto run our born-digital information through the topic-modelingalgorithm The results are notable and come in two formats a wordcloud visualization as well as an XML file containing a line-by-linebreakdown of each topic Taking the aforementioned file 267 (ldquopastand presentrdquo) we can begin to get a sense of the topics this large web-site discusses

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

At a glance we can learn about some of the main themes the struc-ture of Alberta (Alberta towns bridges lakes) the geographic shapeof the area (life land lands areas links grass and interestingly lan-guage) the stories (story knight archives but also interestinglyknight and wife) and so forth The word cloud gives us a sense ofthe most relevant terms that define each topic but we also have a3200-line XML file broken into topics which lets us further refineour inquiry

A last set of examples will help us move from the aggregate levelof hundreds of thousands of lines of text to the individual line inessence creating our own search engine to navigate this collection

There are a few good reasons why we should not use Google todo all of this in our own work First we do not know how Googleworks precisely It is a proprietary black box While the original algo-rithm is available the actual methods that go into determining theranking of search terms is a trade secret Second Google is optimizedfor a number of uses mainly to deliver the most relevant informa-tion to the user Historians and archivists are looking often forspecific information within old websites and it behooves us to exam-ine them through the lens of more specialized tools With thesehesitations in mind I have sought to develop my own rudimentarysearch engine for this born-digital collection

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A two-pronged approach can take us to relevant informationquickly In the following example I am carrying out a project onblack history We quickly search all of the CDC word frequency filesfor the occurrence of the word ldquofreedomrdquo ldquoblackrdquo and ldquolibertyrdquo(these terms of course require some background knowledge thesearch is shaped by the user) obtain a list and then map it onto achart in order to quickly visualize their locations in the collection Forexample in the graph above we see a visualization of a given term inthis case our three search terms As one moves the mouse over eachbar the frequency data for a given term appears We can also zoomand enhance this graph to use it as we wish A demonstration of thiscan be found online at wwwianmilliganca20130118cdc-tool-1

We can then move down into the data In the example aboveldquo047rdquo labeled ldquoblackgoldtxtrdquo contains 31 occurrences of the wordldquoontariordquo Moving to my Mathematica keyword-in-context routinewe can see where these occurrences of the word appear

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

A snapshot of the dynamic finding aid tool

We see that this website focuses mostly on southwestern Ontariowhere many oil fields are If we move forward we can see that it per-tains to the nineteenth century a particular region as well assubsequent memorialization activities (mention of a museum forexample) All this can be done from within the textual client asopposed to going out into individual websites Broad contoursemerge from this distant reading but they also facilitate deeper read-ing by allowing us to promptly realize critical sources

Another model through which we can visualize documents isthrough a viewer primarily focusing on term document frequencyand inverse frequency terms discussed below To prepare this mater-ial we are combining our earlier data about word frequency (that theword ldquoexperiencerdquo appears 919 times perhaps) with aggregate infor-mation about frequency in the entire corpus (ie the wordldquoexperiencerdquo appears 1766 times) The number of times that a termappears in a given website can be defined as tf (term frequency)Another opportunity is inverse document frequency where we canlearn what words are most distinct and unique in a given documentThe formula for this is the oft-cited tf-idf a staple of informationretrieval theory or tf times the logarithm of the number of docu-ments divided into the number of documents where the given termappears At a glance then we can see some of the unique words con-tained within a website if we select ldquoinverse document frequencyrdquo

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

At a very quick glance one can see the major words that make this aunique document This technique can be used on any textual docu-ment

Off-the-shelf tools can facilitate a deeper yet distant reading ofindividual websites as well especially the suite of research tools onwebsites such as Many Eyes and Voyant Using these tools howeverrequires that one puts individual data into each separate programrather than quickly navigating between them This stems from themassive amount of files and data involved Many Eyes limits files to5MB and Voyant has no hard and fast rule but can struggle withlarger files

This analysis was carried out on my personal computer55 Largerdata sets may necessitate a corresponding increase in computationalpower bringing social sciences and humanities researchers into con-tact with High Performance Computing (HPC) centres or consortiaThese extremely powerful computers provide researchers with com-puting power orders of magnitude above that of personal computersallowing results in hours instead of the weeks or months it might

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

take to process on a smaller system56 This has particular applicationin the basic processing of a very large web archive There are limita-tions however jobs must be submitted into queues in advancerather than on-site processing and perhaps most importantly HPCcentres prefer to run highly optimized code to minimize usage oftheir very expensive and scarcely allocated equipment As this level ofcoding expertise is often beyond humanities researchers researchteams must be formed

Humanities researchers would do well to familiarize themselveswith local HPC options many of which are more than happy towork on new and engaging questions These are accessible optionsin Ontario for example all but one university belongs to a HPCconsortium57 Yet as personal computing continues to dramaticallyincrease in speed and as storage prices continue to plummet textualanalysis is increasingly feasible on high-quality office systems An ini-tial exposure can generate questions and problems that if the needrears itself can subsequently be taken to the supercomputer

This case study is intended to provide one model for distantreading and the generation of finding aids in the interest of showingsome of the potential for working with born-digital sources It isintended to begin or jumpstart a conversation around the resourcesand skills necessary for the next generation A seemingly insur-mountable amount of information contained across over 73000files (as in the CDC collection) can be quickly navigated and iso-lated for researcher convenience

Conclusion Digital Literacy for the Next Generation ofHistorians

As a discipline I would like to suggest one path forward prepare thenext generation of historians at least some of them to deal with thedeluge of born-digital sources Fundamentally the challenge of born-digital sources is a pedagogical one For existing historians this is anissue of continual education for future historians an issue of profes-sional training (the annual Digital Humanities Summer Instituteheld at the University of Victoria each June is worth considering asan introduction to many digital concepts) Some off-the-shelf tools

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

can serve as gateways to deeper engagement notably the VoyantTools suite (wwwvoyant-toolsorg) and perhaps subsequently thelessons contained in the Programming Historian 2 That said digitalliteracy courses should become part of the essential pantheon of his-torical curricula This should not be mandatory but instead taughtalongside and within traditional historiography and research meth-ods courses In an era of diminishing resources and attention paid toknowledge translation and relevancy digital literacy courses are alsowell positioned to provide graduates with concrete portfolios andactionable skills If we want to be prepared to use digital method-ologies to study the post-1990 world we will need ready historians

Such a course should have six basic priorities First we must pro-vide historiographical context for new students on whatcomputational methods have been tried before How have quantita-tive and qualitative methods differed yet ultimately complementedeach other Second students in this course would need to develop anunderstanding of the digital basics cloud computing effectivebackup and versioning digital security and an understanding of howthe Internet is organized Third for effective cross-departmental andsustained engagement with digital history students must to learn howto digitally navigate conventional sources notable skill sets wouldinclude a familiarity with citation management software databasesand the possibility of using optical character recognition on a varietyof born-paper sources subsequently digitized through scanners orcameras This must also include a grasp of the scope and significanceof large digital depositories including Google Books the InternetArchive and others mentioned earlier Fourth students must learnhow to process large amounts of digital information this includes off-the-shelf visualization tools material discussed here and some of thephilosophical and methodological implications of a distance readingFifth students can learn basic programming through the ProgrammingHistorian Finally and perhaps most importantly students need anunderstanding of various ways to openly disseminate material on the web whether through blogging micro-blogging publishing webpages and so forth it is important to enable the great work that we do in the university to make its mark on the world Throughthis students will also come away from history departments with

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

comprehensive and marketable digital portfolios Such a coursewould focus more on homework than reading with special emphasisplaced on deliverable content such as websites interactive modelsand broadly implementable databases

Digital history and the readiness to work with born-digitalsources cannot stem out of pedagogy alone however Institutionalsupport is necessary While based on anecdotal evidence it appearsthat some university administrators have begun to embrace the shifttowards the digital humanities At the federal level SSHRC has iden-tified the Digital Economy as one of its priority areas58 Yet when itcomes to evaluation doctoral and post-doctoral researchers do nothave a ldquodigital humanitiesrdquo option open to them in terms of assessingcommittees leading to fears that their projects might be lost inbetween disciplinary areas of focus59 In light of the impending crisisof born-digital sources historians should encourage these initiatives

These sorts of digital projects do not always lend themselveswell to traditional publication venues The examples above from theCDC case study for example are best viewed as dynamically mal-leable models this will be one challenge as historians grapple withnew publication models This article argues first why digitalmethodologies are necessary and second how we can realize aspectsof this call The Internet is 20 years old and before historians realizeit we will be faced with the crisis of born-digital information over-load If history is to continue as the leading discipline inunderstanding the social and cultural past decisive movementtowards the digital humanities is necessary When the sea comes inwe must be ready to swim

IAN MILLIGAN is an assistant professor of Canadian history at theUniversity of Waterloo He is a founding co-editor of ActiveHistorycaan editor-at-large of the Programming Historian 2 and has written sev-eral articles in the fields of Canadian youth digital and labour history

IAN MILLIGAN est professeur adjoint en histoire canadienne agrave laUniversity of Waterloo Il est le fondateur et codirecteur du site

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Internet ActiveHistoryca ainsi qursquoun eacutediteur du site ProgrammingHistorian 2 Il a eacutecrit plusieurs articles au sujet de la jeunesse cana-dienne de la numeacuterisation et de lrsquohistoire des travailleurs

Endnotes

1 The question of when the past becomes a ldquohistoryrdquo to be professionallystudied is a tricky one Yet it is worth noting that the still-definitivework on the Baby Boom Generation appeared in 1996 with DougOwram Born at the Right Time A History of the Baby Boom Generationin Canada (Toronto University of Toronto Press 1996) This actuallyfollowed the earlier scholarly monograph Cyril Levitt Children ofPrivilege Student Revolt in the Sixties (Toronto University of TorontoPress 1984) and a host of other laudable works As academicdebatesworks on the 1960s now suggest the past becomes historysooner than we often expect

2 A critical point advanced in the field of literary criticism and theory byStephen Ramsay Reading Machines Towards an Algorithmic Criticism(Urbana University of Illinois Press 2011)

3 See ldquoIdleNoMore Timelinerdquo ongoingwwwidlenomoremakookcatimeline44 ltviewed 18 January 2013gt

4 These issues have been discussed elsewhere albeit in either a moreabridged form or with a focus on digitized historical sources A notableexception is Daniel J Cohen and Roy Rosenzweig Digital History AGuide to Gathering Preserving and Presenting the Past on the Web(Philadelphia University of Pennsylvania Press 2005) specifically theirchapter on ldquoCollecting History onlinerdquo Their focus is on the preservation-ist and archival challenges however Important notes are also found inDan Cohen ldquoHistory and the Second Decade of the Webrdquo RethinkingHistory 8 no 2 (June 2004) 293ndash301 also available online atwwwchnmgmueduessays-on-history-new-mediaessaysessayid=34

5 Figure taken from Jose Martinez ldquoTwitter CEO Dick Costolo Reveals Staggering Number of Tweets Per Dayrdquo ComplexTech wwwcomplexcomtech201210twitter-ceo-dick-costolo-reveals-staggering-number-of-tweets-per-day ltviewed 6 January 2013gtFurther more while a tweet itself is only 140 characters every tweetcomes with over 40 lines of metadata date information author biogra-phy and location information when the account was created the usersrsquofollowing and follower numbers country data and so forth This is aremarkable treasure trove of information for future historians but also apotential source of information overload See Sarah Perez ldquoThis is What

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

a Tweet Looks Likerdquo ReadWriteWeb (19 April 2010) wwwreadwritewebcomarchivesthis_is_what_a_tweet_looks_likephpltviewed 6 January 2013gt

6 This is further discussed in Ian Foster ldquoHow Computation ChangesResearchrdquo in Switching Codes Thinking Through Digital Technology inthe Humanities and the Arts eds Thomas Bartscherer and RoderickCoover (Chicago University of Chicago Press 2011) esp 22ndash4

7 Franco Moretti Graphs Maps Trees Abstract Models for Literary History(New York Verso 2005) 3ndash4

8 While not of prime focus here the longue dureacutee deserves definition Itstands opposite the history of events of instances covering instead avery long time span in an interdisciplinary social science framework In along essay Braudel noted that one can then see both structural crises ofeconomies and structures that constrain human society and develop-ment For more see Fernand Braudel ldquoHistory and the Social SciencesThe Longue Dureacuteerdquo in Fernand Braudel On History trans SarahMatthews (Chicago University of Chicago Press 1980) 27 The essaywas originally printed in the Annales ESC no 4 (OctoberndashDecember1958)

9 A great overview can be found at ldquoText Analysisrdquo Tooling Up for DigitalHumanities 2011 wwwtoolingupstanfordedupage_id=981 ltviewed19 December 2012gt

10 Jean-Baptiste Michel Yuan Kui Shen Aviva Presser Aiden Adrian VeresMatthew K Gray The Google Books Team Joseph P Pickett DaleHoiberg Dan Clancy Peter Norvig Jon Orwant Steven Pinker MartinA Nowak Erez Lieberman Aiden ldquoQuantitative Analysis of CultureUsing Millions of Digitized Booksrdquo Science 331 (14 January 2011)176ndash82 The tool is publicly accessible at ldquoGoogle Books NgramViewerrdquo Google Books 2010 wwwbooksgooglecomngrams ltviewed11 January 2013gt

11 Anthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

12 Comment by Jean-Baptiste Michel and Erez Lieberman Aiden onAnthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

13 For more information see the Canadian Century Research InfrastructureProject website at wwwccriuottawacaCCRIHomehtml and the Programme de recherche en deacutemographie historique at wwwgeneologyumontrealcafrleprdhhtm

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

14 ldquoWith Criminal Intentrdquo CriminalIntentorg wwwcriminalintentorgltviewed 20 December 2012gt

15 For an overview see ldquoAward Recipients mdash 2011rdquo Digging into DataWebpage wwwdiggingintodataorgHomeAwardRecipients2011tabid185Defaultaspx ltviewed 20 December 2012gt

16 One can do no better than the completely free Programming Historianwhich uses entirely free and open-source programming tools to learn thebasics of computational history See The Programming Historian 2 2012-Present wwwprogramminghistorianorg ltviewed 11 January 2013gt

17 Michael Katz The People of Hamilton Canada West Family and Class ina Mid-Nineteenth-Century City (Cambridge MA Harvard UniversityPress 1975) Also see A Gordon Darroch and Michael D OrnsteinldquoEthnicity and Occupational Structure in Canada in 1871 The VerticalMosaic in Historical Perspectiverdquo Canadian Historical Review 61 no 3(1980) 305ndash33

18 Ian Anderson ldquoHistory and Computingrdquo Making History The ChangingFace of the Profession in Britain 2008 wwwhistoryacukmakinghistoryresourcesarticleshistory_and_computinghtml ltviewed 20 December2012gt

19 Christopher Collins ldquoText and Document Visualizationrdquo TheHumanities and Technology Camp Greater Toronto Area VisualizationBootcamp University of Ontario Institute of Technology Oshawa Ont21 October 2011

20 The quotations by Luther and Poe are commonly used to make thispoint For the best example see Clay Shirky Cognitive SurplusCreativity and Generosity in a Connected Age Google eBook Penguin2010 For Mumford see Lewis Mumford The Myth of the Machine vol2 The Pentagon of Power (New York Harcourt Brace 1970) 182 asquoted in Gleick The Information A History A Theory A Flood (NewYork Pantheon 2011) 404

21 Adam Ostrow ldquoCisco CRS-3 Download Library of Congress in JustOver One Secondrdquo Mashable (9 March 2010) accessed onlinewwwmashablecom20100309cisco-crs-3

22 A note on data size is necessary The basic unit of information is a ldquobyterdquo (8 bits) which is roughly a single character A kilobyte (KB) is athousand bytes a megabyte (MB) a million a gigabyte (GB) a billion aterabyte (TB) a trillion and a petabyte (PB) a quadrillion Beyond thisthere are exabytes (EB) zettabytes (ZB) and yottabytes (YB)

23 These figures can of course vary depending on compression methodschosen Their math was based on an average of 8MB per book if it wasscanned at 600dpi and then compressed There are 26 million books inthe Library of Congress For more see Mike Ashenfelder ldquoTransferringlsquoLibraries of Congressrsquo of Datardquo Library of Congress Digital Preservation

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Blog 11 July 2011wwwblogslocgovdigitalpreservation201107transferring-libraries-of-congress-of-data ltviewed 11 January 2013gt

24 Gleick 232 25 ldquoAbout the Library of Congress Web Archivesrdquo Library of Congress Web

Archiving FAQs Undated wwwlocgovwebarchivingfaqhtmlfaqs_02ltviewed 11 January 2013gt

26 ldquo10000000000000000 Bytes Archivedrdquo Internet Archive Blog 26October 2012 wwwblogarchiveorg2012102610000000000000000-bytes-archivedltviewed 19 December 2012gt

27 Jean-Baptiste Michel et al 176ndash82 Published Online Ahead of Print16 December 2010 wwwsciencemagorgcontent3316014176ltviewed 29 July 2011gt See also Scott Cleland ldquoGooglersquos PiracyLiabilityrdquo Forbes 9 November 2011 wwwforbescomsitesscottcle-land20111109googles-piracy-liability ltviewed 11 January 2013gt

28 ldquoGovernment of Canada Web Archiverdquo Library and Archives Canada 17October 2007 wwwcollectionscanadagccawebarchivesindex-ehtmlltviewed 11 January 2013gt

29 For background see Bill Curry ldquoVisiting Library and Archives inOttawa Not Without an Appointmentrdquo Globe and Mail 1 May 2012wwwtheglobeandmailcomnewspoliticsottawa-notebookvisiting-library-and-archives-in-ottawa-not-without-an-appointmentarticle2418960 ltviewed 22 May 2012gt and the announcement ldquoLAC Begins Implementation of New Approach to service Deliveryrdquo Library and Archives Canada February 2012wwwcollectionscanadagccawhats-new013-560-ehtml ltviewed 22May 2012gt For professional responses see the ldquoCHArsquos comment onNew Directions for Library and Archives Canadardquo CHA Website March2012 wwwcha-shccaenNews_39items19html ltviewed 19December 2012gt and ldquoSave Library amp Archives Canadardquo CanadianAssociation of University Teachers launched AprilMay 2012 wwwsavelibraryarchivesca ltviewed 19 December 2012gt

30 Ian Milligan ldquoThe Smokescreen of lsquoModernizationrsquo at Library andArchives Canadardquo ActiveHistoryca 22 May 2012wwwactivehistoryca201205the-smokescreen-of-modernization-at-library-and-archives-canada ltviewed 19 December 2012gt

31 ldquoSave LAC May 2012 Campaign Updaterdquo SaveLibraryArchivesca May2012 wwwsavelibraryarchivescaupdate-2012-05aspx ltviewed 19December 2012gt

32 The challenges and opportunities presented by such projects is extensively discussed in Daniel J Cohen ldquoThe Future of Preserving the Pastrdquo CRM The Journal of Heritage Stewardship 2 no 2

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

(Summer 2005) 6ndash19 also available freely online at the RoyRosenzweig Center for History and New Mediawwwchnmgmueduessays-on-history-new-mediaessaysessayid=39For more see ldquoSeptember 11 Digital Archive Saving the Histories of September 11 2001rdquo Centre for History and New Media andAmerican Social History Project Center for Media and Learningwww911digitalarchiveorg ltviewed 11 January 2013gt ldquoHurricaneDigital Memory Bank Collecting and Preserving the Stories of Katrina and Ritardquo Center for History and New Media wwwhurricanearchiveorg ltviewed 11 January 2013gt and ldquoOccupyArchive Archiving the Occupy Movements from 2011rdquo Centre forHistory and New Media wwwoccupyarchiveorg ltviewed 11 January2013gt

33 Daniel Cohen and Roy Rosenzweig Digital History A Guide to Gathering Preserving and Presenting the Past on the Web(Philadelphia Temple University Press 2005) chapter on ldquoPreservingDigital Historyrdquo The chapter is available fully online atwwwchnmgmuedudigitalhistorypreserving5php ltviewed 19December 2012gt

34 For an overview see ldquoPreserving Our Digital Heritage The NationalDigital Information Infrastructure and Preservation Program 2010 Report a Collaborative Initiative of the Library of CongressrdquoDigitalpreservationgov wwwdigitalpreservationgovmultimediadocumentsNDIIPP2010Report_Postpdf ltviewed 19 December 2012gt

35 Steve Meloan ldquoNo Way to Run a Culturerdquo Wired 13 February 1998available online via the Internet Archivewwwwebarchiveorgweb20000619001705httpwwwwiredcomnewsculture012841030100html ltviewed 19 December 2012gt

36 Some of this has been discussed in greater depth on my website See theseries beginning with Ian Milligan ldquoWARC Files A Challenge forHistorians and Finding Needles in Haystacksrdquo ianmilliganca 12December 2012 wwwianmilliganca20121212warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks ltviewed 20 December2012gt

37 Unstructured data as opposed to structured data is important for histo-rians but brings additional challenges Some projects have producedstructured data novels marked up using XML encoding for example soone can immediately see place names dates character names and soforth This makes it much easier for a computer to read and understand

38 See William Noblett ldquoDigitization A Cautionary Talerdquo New Review ofAcademic Librarianship 17 (2011) 3 Tobias Blanke Michael Bryantand Mark Hedges ldquoOcropodium Open Source OCR for Small-ScaleHistorical Archivesrdquo Journal of Information Science 38 no 1 (2012) 83

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 25: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

enhance data usability first the information was ldquocase-foldedrdquo allsent to lower-case this enables us to count the total appearances of aword such as ldquodigitalrdquo regardless of whether or not it appears at thebeginning of a sentence

Second when dealing with word frequency mdash as opposed tophrase frequency mdash ldquostop wordsrdquo should be removed as they werehere These are ldquoextremely common words which would appear to beof little value in helping select documents matching a user needrdquowords such as ldquotherdquo ldquoandrdquo ldquotordquo and so forth They matter forphrases (the example ldquoto be or not to berdquo is comprised entirely ofstop words) but will quickly dominate any word-level visualizationor list53 Other words may need to be added or removed from stoplists depending on the corpus for example if the phrase ldquopage num-berrdquo appeared on every page that would be useless information forthe researcher

The first step then is to use Mathematica to format word fre-quency information into Wordlenet readable format (if we put allof the text into that website it will crash mdash so we need to followthe instructions on that website) we can then draw up a wordcloud of the entire collection Here we can see both the advantagesand disadvantages of static word clouds What can we learn from this

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A Word Cloud of All Words in the CDC Corpus

illustration First without knowing anything about the sites we canimmediately learn that it is dealing with CanadaCanadians that itis a collection with documents in French and English that we aredealing with digitized collections and that there is a fair amount ofregional representation not based upon population (note thatldquoontariordquo is less frequent than both ldquonewfoundlandrdquo ldquonova scotiardquoand ldquoalbertardquo although ldquoqueacutebecrdquo and ldquoquebecrdquo is not terribly promi-nent) Certain kinds of history dominate communities familynames the built form (houses churches schools and so forth) At avery macro level we can learn several characteristics before doing anydeeper investigation

These forms of visualization need to be used with extreme cau-tion however First while visually attractive it leaves a severeamount of ambiguity are ldquoontariordquo and ldquoqueacutebecrdquo equally promi-nent for example or is there a slight divergence Even moreproblematically we cannot learn context from the above picturedoes the word ldquoschoolrdquo refer to built structures or educational expe-riences The same question with applies to ldquochurchrdquo Do mentionsof ldquonewfoundlandrdquo refer to historical experiences there or withexpatriates or in travel narratives Word clouds are useful first stepsand have their place but need to be used with extreme caution Formore scholarly applications tables offer much more precise inter-pretation As the second figure below demonstrates they can bemore helpful when dealing with the individual sites cuttingthrough ambiguity The figure below is a visualization of the web-site ldquopast and presentrdquo Despite the ambiguous title we can quicklylearn what this website contains from this visualization Wordle isthus a useful first step analogous to taking magically taking a quicklook through the side of an archival box We still need more infor-mation however to really make this form of textual analysishelpful To navigate this problem I have created my own dynamicfinding aid program using Mathematica which scales to any born-digital web collection

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Two approaches were taken First a textual representation of wordand phrase frequency allows us to determine the content of a given filerelatively quickly I wrote a short program to extract word frequency(stop words excluded) and phrase frequency For phrases we speak ofn-grams bigrams trigrams quadgrams and fivegrams although onecould theoretically do any higher number A bigram can be a combina-tion of two characters such as lsquobirsquo two syllables or mdash pertinent in ourcase mdash two words An example ldquocanada isrdquo A trigram is three wordsquadgram is four words and a fivegram is five words and so on

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Word Cloud of the File ldquoPast and Presentrdquo

A viewer for n-grams in the CDC Collection

The CDC collection demonstrates this approachrsquos utility Usingmy Mathematica textual CDC browser pictured above I randomlyexplored file 267 the ambiguously named ldquopast and presentrdquo web-site This is one of the largest individual websites 122363 words orabout 493 standard typed pages Such an ambiguous name could beabout almost anything Initially I decided to study word frequency

ldquoldquoalbertardquo 758rdquo ldquoldquolistenrdquo 490rdquo ldquoldquosettlersrdquo 467rdquoldquoldquonewrdquo 454rdquo ldquoldquocanadardquo 388rdquo ldquoldquolandrdquo 362rdquo ldquoldquoreadrdquo328rdquo ldquoldquopeoplerdquo 295rdquo ldquoldquosettlementrdquo 276rdquo ldquoldquochineserdquo268rdquo ldquoldquocanadianrdquo 264rdquo ldquoldquoplacerdquo 255rdquo ldquoldquoheritagerdquo254rdquo ldquoldquoedmontonrdquo 250rdquo ldquoldquoyearsrdquo 238rdquo ldquoldquowestrdquo 236rdquoldquoldquoraymondrdquo 218rdquo ldquoldquoukrainianrdquo 216rdquo ldquoldquocamerdquo 213rdquoldquoldquocalgaryrdquo 209rdquo ldquoldquonamesrdquo 206rdquo ldquoldquoranchrdquo 204rdquoldquoldquoupdatedrdquo 193rdquo ldquoldquohistoryrdquo 190rdquo ldquoldquocommunitiesrdquo183rdquo hellip

Rather than having to visit the website at a glance we can almostimmediately see what the site is about More topics become apparentwhen one looks at trigrams for example in this automatically gen-erated output of generated top phrases

ldquoldquofirstrdquo ldquopeoplerdquo ldquosettlersrdquo 100rdquo ldquoldquotherdquo ldquounitedrdquoldquostatesrdquo 72rdquo ldquoldquoonerdquo ldquoofrdquo ldquotherdquo 59rdquo ldquoldquoofrdquo ldquotherdquoldquowestrdquo 58rdquo ldquoldquosettlersrdquo ldquolastrdquo ldquoupdatedrdquo 56rdquo ldquoldquopeoplerdquoldquosettlersrdquo ldquolastrdquo 56rdquo ldquoldquoalbertansrdquo ldquofirstrdquo ldquopeoplerdquo 56rdquoldquoldquoadventurousrdquo ldquoalbertansrdquo ldquofirstrdquo 56rdquo ldquoldquotherdquo ldquobarrdquoldquourdquo 55rdquo ldquoldquoopeningrdquo ldquoofrdquo ldquotherdquo 54rdquo ldquoldquocommunitiesrdquoldquoadventurousrdquo ldquoalbertansrdquo 54rdquo ldquoldquonewrdquo ldquocommunitiesrdquoldquoadventurousrdquo 54rdquo ldquoldquobarrdquo ldquourdquo ldquoranchrdquo 53rdquo ldquoldquolistenrdquoldquotordquo ldquotherdquo 52rdquo hellip

Through a few clicks we can navigate and determine the basiccontours mdash from a distance mdash of what the site pertains to In aviewer you can use a slider to dynamically see more informationor less information quickly toggling between bigrams trigramsand quadgrams allowing you to quickly hone in on immediately

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

relevant information This can be useful but we need to moveback even further to achieve a distant reading of this born-digitalcollection

If one zooms back out to this very distant level we can extractn-gram data from the entire collection We have two approachesfirst aggregate appearances of a given term (if ldquoaboriginalrdquo appears457 times then n=457) or second relative appearances of a giventerm (if ldquoaboriginalrdquo appears in 45 of 384 sites then n=45384 or01171875) In my other work on music lyrics the latter optionenables you to control for the lsquochorusrsquo effect as the former can be drowned out by one song with repetitive words Given thenature of this collection I elected for the former approach Whatcan we learn from trigrams for example After an initial set of trigrams pertaining to the Louisbourg digitization project whichhas boilerplate language on each page as well as common ldquostop-wordrdquo equivalents of phrases (especially pronounced in French) we come across the first set of meaningful phrases excerptedbelow

ldquoexperiencerdquo ldquoinrdquo ldquotherdquo 919 ldquotransferredrdquo ldquotordquo ldquotherdquo918 ldquofundedrdquo ldquobyrdquo ldquotherdquo 918 ldquoofrdquo ldquotherdquo ldquoprovincialrdquo915 ldquothisrdquo ldquowebrdquo ldquositerdquo 915 ldquo409rdquo ldquoregistryrdquo ldquo2rdquo914 ldquoarerdquo ldquonordquo ldquolongerrdquo 914 ldquogouvernementrdquo ldquodurdquo ldquocanadardquo 910 ldquouserdquo ldquothisrdquo ldquolinkrdquo 887ldquopagerdquo ldquoorrdquo ldquoyourdquo 887 ldquofollowingrdquo ldquopagerdquo ldquoorrdquo 887 ldquoautomaticallyrdquo ldquotransferredrdquo ldquotordquo 887 ldquoberdquoldquoautomaticallyrdquo ldquotransferredrdquo 887 ldquowillrdquo ldquoberdquo ldquoautomaticallyrdquo 887 ldquoindustryrdquo ldquoyourdquo ldquowillrdquo 887ldquomultimediardquo ldquoindustryrdquo ldquoyourdquo 887 ldquopracticalrdquo ldquoexperiencerdquo ldquoinrdquo 887 ldquogainedrdquo ldquopracticalrdquo ldquoexperiencerdquo887 ldquotheyrdquo ldquogainedrdquo ldquopracticalrdquo 887 ldquowhilerdquo ldquotheyrdquoldquogainedrdquo 887 ldquocanadiansrdquo ldquowhilerdquo ldquotheyrdquo 887ldquoyoungrdquo ldquocanadiansrdquo ldquowhilerdquo 887 ldquoshowcaserdquo ldquoofrdquoldquoworkrdquo 887 ldquoexcellentrdquo ldquoshowcaserdquo ldquoofrdquo 887 hellip

Here we see an array of common language funding informa-tion link information project information spread across over 887

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

websites Reading down we see information about governments jobexperience project information initiatives re-structuring and soforth At a glance we can see what language is common to all web-sites as opposed to unique information Moving down we find themore common but less universal ldquothe history ofrdquo for example ldquoher-itage community foundationrdquo libraries and so forth By the end ofthe list we can move into the unique or very rare phrases that appearonly once twice or three times

Moving beyond word and phrase frequency there are a fewmore sophisticated off-the-shelf options that can help us make senseof large arrays of unstructured data Topic modeling is among themost developed as Princetonrsquos David M Blei (a Computer Scienceprofessor) describes in a general introduction to the topic

While more and more texts are available online we simplydo not have the human power to read and study them toprovide the kind of [traditional] browsing experiencedescribed above To this end machine learning researchershave developed probabilistic topic modeling a suite of algo-rithms that aim to discover and annotate large archives ofdocuments with thematic information Topic modelingalgorithms are statistical methods that analyze the words ofthe original texts to discover the themes that run throughthem how those themes are connected to each other andhow they change over time hellip Topic modeling enables usto organize and summarize electronic archives at a scalethat would be impossible by human annotation55

Using the aforementioned SEASR Meandre workbench (or viaMALLET itself and a command-line interface) we can quickly beginto run our born-digital information through the topic-modelingalgorithm The results are notable and come in two formats a wordcloud visualization as well as an XML file containing a line-by-linebreakdown of each topic Taking the aforementioned file 267 (ldquopastand presentrdquo) we can begin to get a sense of the topics this large web-site discusses

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

At a glance we can learn about some of the main themes the struc-ture of Alberta (Alberta towns bridges lakes) the geographic shapeof the area (life land lands areas links grass and interestingly lan-guage) the stories (story knight archives but also interestinglyknight and wife) and so forth The word cloud gives us a sense ofthe most relevant terms that define each topic but we also have a3200-line XML file broken into topics which lets us further refineour inquiry

A last set of examples will help us move from the aggregate levelof hundreds of thousands of lines of text to the individual line inessence creating our own search engine to navigate this collection

There are a few good reasons why we should not use Google todo all of this in our own work First we do not know how Googleworks precisely It is a proprietary black box While the original algo-rithm is available the actual methods that go into determining theranking of search terms is a trade secret Second Google is optimizedfor a number of uses mainly to deliver the most relevant informa-tion to the user Historians and archivists are looking often forspecific information within old websites and it behooves us to exam-ine them through the lens of more specialized tools With thesehesitations in mind I have sought to develop my own rudimentarysearch engine for this born-digital collection

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A two-pronged approach can take us to relevant informationquickly In the following example I am carrying out a project onblack history We quickly search all of the CDC word frequency filesfor the occurrence of the word ldquofreedomrdquo ldquoblackrdquo and ldquolibertyrdquo(these terms of course require some background knowledge thesearch is shaped by the user) obtain a list and then map it onto achart in order to quickly visualize their locations in the collection Forexample in the graph above we see a visualization of a given term inthis case our three search terms As one moves the mouse over eachbar the frequency data for a given term appears We can also zoomand enhance this graph to use it as we wish A demonstration of thiscan be found online at wwwianmilliganca20130118cdc-tool-1

We can then move down into the data In the example aboveldquo047rdquo labeled ldquoblackgoldtxtrdquo contains 31 occurrences of the wordldquoontariordquo Moving to my Mathematica keyword-in-context routinewe can see where these occurrences of the word appear

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

A snapshot of the dynamic finding aid tool

We see that this website focuses mostly on southwestern Ontariowhere many oil fields are If we move forward we can see that it per-tains to the nineteenth century a particular region as well assubsequent memorialization activities (mention of a museum forexample) All this can be done from within the textual client asopposed to going out into individual websites Broad contoursemerge from this distant reading but they also facilitate deeper read-ing by allowing us to promptly realize critical sources

Another model through which we can visualize documents isthrough a viewer primarily focusing on term document frequencyand inverse frequency terms discussed below To prepare this mater-ial we are combining our earlier data about word frequency (that theword ldquoexperiencerdquo appears 919 times perhaps) with aggregate infor-mation about frequency in the entire corpus (ie the wordldquoexperiencerdquo appears 1766 times) The number of times that a termappears in a given website can be defined as tf (term frequency)Another opportunity is inverse document frequency where we canlearn what words are most distinct and unique in a given documentThe formula for this is the oft-cited tf-idf a staple of informationretrieval theory or tf times the logarithm of the number of docu-ments divided into the number of documents where the given termappears At a glance then we can see some of the unique words con-tained within a website if we select ldquoinverse document frequencyrdquo

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

At a very quick glance one can see the major words that make this aunique document This technique can be used on any textual docu-ment

Off-the-shelf tools can facilitate a deeper yet distant reading ofindividual websites as well especially the suite of research tools onwebsites such as Many Eyes and Voyant Using these tools howeverrequires that one puts individual data into each separate programrather than quickly navigating between them This stems from themassive amount of files and data involved Many Eyes limits files to5MB and Voyant has no hard and fast rule but can struggle withlarger files

This analysis was carried out on my personal computer55 Largerdata sets may necessitate a corresponding increase in computationalpower bringing social sciences and humanities researchers into con-tact with High Performance Computing (HPC) centres or consortiaThese extremely powerful computers provide researchers with com-puting power orders of magnitude above that of personal computersallowing results in hours instead of the weeks or months it might

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

take to process on a smaller system56 This has particular applicationin the basic processing of a very large web archive There are limita-tions however jobs must be submitted into queues in advancerather than on-site processing and perhaps most importantly HPCcentres prefer to run highly optimized code to minimize usage oftheir very expensive and scarcely allocated equipment As this level ofcoding expertise is often beyond humanities researchers researchteams must be formed

Humanities researchers would do well to familiarize themselveswith local HPC options many of which are more than happy towork on new and engaging questions These are accessible optionsin Ontario for example all but one university belongs to a HPCconsortium57 Yet as personal computing continues to dramaticallyincrease in speed and as storage prices continue to plummet textualanalysis is increasingly feasible on high-quality office systems An ini-tial exposure can generate questions and problems that if the needrears itself can subsequently be taken to the supercomputer

This case study is intended to provide one model for distantreading and the generation of finding aids in the interest of showingsome of the potential for working with born-digital sources It isintended to begin or jumpstart a conversation around the resourcesand skills necessary for the next generation A seemingly insur-mountable amount of information contained across over 73000files (as in the CDC collection) can be quickly navigated and iso-lated for researcher convenience

Conclusion Digital Literacy for the Next Generation ofHistorians

As a discipline I would like to suggest one path forward prepare thenext generation of historians at least some of them to deal with thedeluge of born-digital sources Fundamentally the challenge of born-digital sources is a pedagogical one For existing historians this is anissue of continual education for future historians an issue of profes-sional training (the annual Digital Humanities Summer Instituteheld at the University of Victoria each June is worth considering asan introduction to many digital concepts) Some off-the-shelf tools

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

can serve as gateways to deeper engagement notably the VoyantTools suite (wwwvoyant-toolsorg) and perhaps subsequently thelessons contained in the Programming Historian 2 That said digitalliteracy courses should become part of the essential pantheon of his-torical curricula This should not be mandatory but instead taughtalongside and within traditional historiography and research meth-ods courses In an era of diminishing resources and attention paid toknowledge translation and relevancy digital literacy courses are alsowell positioned to provide graduates with concrete portfolios andactionable skills If we want to be prepared to use digital method-ologies to study the post-1990 world we will need ready historians

Such a course should have six basic priorities First we must pro-vide historiographical context for new students on whatcomputational methods have been tried before How have quantita-tive and qualitative methods differed yet ultimately complementedeach other Second students in this course would need to develop anunderstanding of the digital basics cloud computing effectivebackup and versioning digital security and an understanding of howthe Internet is organized Third for effective cross-departmental andsustained engagement with digital history students must to learn howto digitally navigate conventional sources notable skill sets wouldinclude a familiarity with citation management software databasesand the possibility of using optical character recognition on a varietyof born-paper sources subsequently digitized through scanners orcameras This must also include a grasp of the scope and significanceof large digital depositories including Google Books the InternetArchive and others mentioned earlier Fourth students must learnhow to process large amounts of digital information this includes off-the-shelf visualization tools material discussed here and some of thephilosophical and methodological implications of a distance readingFifth students can learn basic programming through the ProgrammingHistorian Finally and perhaps most importantly students need anunderstanding of various ways to openly disseminate material on the web whether through blogging micro-blogging publishing webpages and so forth it is important to enable the great work that we do in the university to make its mark on the world Throughthis students will also come away from history departments with

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

comprehensive and marketable digital portfolios Such a coursewould focus more on homework than reading with special emphasisplaced on deliverable content such as websites interactive modelsand broadly implementable databases

Digital history and the readiness to work with born-digitalsources cannot stem out of pedagogy alone however Institutionalsupport is necessary While based on anecdotal evidence it appearsthat some university administrators have begun to embrace the shifttowards the digital humanities At the federal level SSHRC has iden-tified the Digital Economy as one of its priority areas58 Yet when itcomes to evaluation doctoral and post-doctoral researchers do nothave a ldquodigital humanitiesrdquo option open to them in terms of assessingcommittees leading to fears that their projects might be lost inbetween disciplinary areas of focus59 In light of the impending crisisof born-digital sources historians should encourage these initiatives

These sorts of digital projects do not always lend themselveswell to traditional publication venues The examples above from theCDC case study for example are best viewed as dynamically mal-leable models this will be one challenge as historians grapple withnew publication models This article argues first why digitalmethodologies are necessary and second how we can realize aspectsof this call The Internet is 20 years old and before historians realizeit we will be faced with the crisis of born-digital information over-load If history is to continue as the leading discipline inunderstanding the social and cultural past decisive movementtowards the digital humanities is necessary When the sea comes inwe must be ready to swim

IAN MILLIGAN is an assistant professor of Canadian history at theUniversity of Waterloo He is a founding co-editor of ActiveHistorycaan editor-at-large of the Programming Historian 2 and has written sev-eral articles in the fields of Canadian youth digital and labour history

IAN MILLIGAN est professeur adjoint en histoire canadienne agrave laUniversity of Waterloo Il est le fondateur et codirecteur du site

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Internet ActiveHistoryca ainsi qursquoun eacutediteur du site ProgrammingHistorian 2 Il a eacutecrit plusieurs articles au sujet de la jeunesse cana-dienne de la numeacuterisation et de lrsquohistoire des travailleurs

Endnotes

1 The question of when the past becomes a ldquohistoryrdquo to be professionallystudied is a tricky one Yet it is worth noting that the still-definitivework on the Baby Boom Generation appeared in 1996 with DougOwram Born at the Right Time A History of the Baby Boom Generationin Canada (Toronto University of Toronto Press 1996) This actuallyfollowed the earlier scholarly monograph Cyril Levitt Children ofPrivilege Student Revolt in the Sixties (Toronto University of TorontoPress 1984) and a host of other laudable works As academicdebatesworks on the 1960s now suggest the past becomes historysooner than we often expect

2 A critical point advanced in the field of literary criticism and theory byStephen Ramsay Reading Machines Towards an Algorithmic Criticism(Urbana University of Illinois Press 2011)

3 See ldquoIdleNoMore Timelinerdquo ongoingwwwidlenomoremakookcatimeline44 ltviewed 18 January 2013gt

4 These issues have been discussed elsewhere albeit in either a moreabridged form or with a focus on digitized historical sources A notableexception is Daniel J Cohen and Roy Rosenzweig Digital History AGuide to Gathering Preserving and Presenting the Past on the Web(Philadelphia University of Pennsylvania Press 2005) specifically theirchapter on ldquoCollecting History onlinerdquo Their focus is on the preservation-ist and archival challenges however Important notes are also found inDan Cohen ldquoHistory and the Second Decade of the Webrdquo RethinkingHistory 8 no 2 (June 2004) 293ndash301 also available online atwwwchnmgmueduessays-on-history-new-mediaessaysessayid=34

5 Figure taken from Jose Martinez ldquoTwitter CEO Dick Costolo Reveals Staggering Number of Tweets Per Dayrdquo ComplexTech wwwcomplexcomtech201210twitter-ceo-dick-costolo-reveals-staggering-number-of-tweets-per-day ltviewed 6 January 2013gtFurther more while a tweet itself is only 140 characters every tweetcomes with over 40 lines of metadata date information author biogra-phy and location information when the account was created the usersrsquofollowing and follower numbers country data and so forth This is aremarkable treasure trove of information for future historians but also apotential source of information overload See Sarah Perez ldquoThis is What

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

a Tweet Looks Likerdquo ReadWriteWeb (19 April 2010) wwwreadwritewebcomarchivesthis_is_what_a_tweet_looks_likephpltviewed 6 January 2013gt

6 This is further discussed in Ian Foster ldquoHow Computation ChangesResearchrdquo in Switching Codes Thinking Through Digital Technology inthe Humanities and the Arts eds Thomas Bartscherer and RoderickCoover (Chicago University of Chicago Press 2011) esp 22ndash4

7 Franco Moretti Graphs Maps Trees Abstract Models for Literary History(New York Verso 2005) 3ndash4

8 While not of prime focus here the longue dureacutee deserves definition Itstands opposite the history of events of instances covering instead avery long time span in an interdisciplinary social science framework In along essay Braudel noted that one can then see both structural crises ofeconomies and structures that constrain human society and develop-ment For more see Fernand Braudel ldquoHistory and the Social SciencesThe Longue Dureacuteerdquo in Fernand Braudel On History trans SarahMatthews (Chicago University of Chicago Press 1980) 27 The essaywas originally printed in the Annales ESC no 4 (OctoberndashDecember1958)

9 A great overview can be found at ldquoText Analysisrdquo Tooling Up for DigitalHumanities 2011 wwwtoolingupstanfordedupage_id=981 ltviewed19 December 2012gt

10 Jean-Baptiste Michel Yuan Kui Shen Aviva Presser Aiden Adrian VeresMatthew K Gray The Google Books Team Joseph P Pickett DaleHoiberg Dan Clancy Peter Norvig Jon Orwant Steven Pinker MartinA Nowak Erez Lieberman Aiden ldquoQuantitative Analysis of CultureUsing Millions of Digitized Booksrdquo Science 331 (14 January 2011)176ndash82 The tool is publicly accessible at ldquoGoogle Books NgramViewerrdquo Google Books 2010 wwwbooksgooglecomngrams ltviewed11 January 2013gt

11 Anthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

12 Comment by Jean-Baptiste Michel and Erez Lieberman Aiden onAnthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

13 For more information see the Canadian Century Research InfrastructureProject website at wwwccriuottawacaCCRIHomehtml and the Programme de recherche en deacutemographie historique at wwwgeneologyumontrealcafrleprdhhtm

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

14 ldquoWith Criminal Intentrdquo CriminalIntentorg wwwcriminalintentorgltviewed 20 December 2012gt

15 For an overview see ldquoAward Recipients mdash 2011rdquo Digging into DataWebpage wwwdiggingintodataorgHomeAwardRecipients2011tabid185Defaultaspx ltviewed 20 December 2012gt

16 One can do no better than the completely free Programming Historianwhich uses entirely free and open-source programming tools to learn thebasics of computational history See The Programming Historian 2 2012-Present wwwprogramminghistorianorg ltviewed 11 January 2013gt

17 Michael Katz The People of Hamilton Canada West Family and Class ina Mid-Nineteenth-Century City (Cambridge MA Harvard UniversityPress 1975) Also see A Gordon Darroch and Michael D OrnsteinldquoEthnicity and Occupational Structure in Canada in 1871 The VerticalMosaic in Historical Perspectiverdquo Canadian Historical Review 61 no 3(1980) 305ndash33

18 Ian Anderson ldquoHistory and Computingrdquo Making History The ChangingFace of the Profession in Britain 2008 wwwhistoryacukmakinghistoryresourcesarticleshistory_and_computinghtml ltviewed 20 December2012gt

19 Christopher Collins ldquoText and Document Visualizationrdquo TheHumanities and Technology Camp Greater Toronto Area VisualizationBootcamp University of Ontario Institute of Technology Oshawa Ont21 October 2011

20 The quotations by Luther and Poe are commonly used to make thispoint For the best example see Clay Shirky Cognitive SurplusCreativity and Generosity in a Connected Age Google eBook Penguin2010 For Mumford see Lewis Mumford The Myth of the Machine vol2 The Pentagon of Power (New York Harcourt Brace 1970) 182 asquoted in Gleick The Information A History A Theory A Flood (NewYork Pantheon 2011) 404

21 Adam Ostrow ldquoCisco CRS-3 Download Library of Congress in JustOver One Secondrdquo Mashable (9 March 2010) accessed onlinewwwmashablecom20100309cisco-crs-3

22 A note on data size is necessary The basic unit of information is a ldquobyterdquo (8 bits) which is roughly a single character A kilobyte (KB) is athousand bytes a megabyte (MB) a million a gigabyte (GB) a billion aterabyte (TB) a trillion and a petabyte (PB) a quadrillion Beyond thisthere are exabytes (EB) zettabytes (ZB) and yottabytes (YB)

23 These figures can of course vary depending on compression methodschosen Their math was based on an average of 8MB per book if it wasscanned at 600dpi and then compressed There are 26 million books inthe Library of Congress For more see Mike Ashenfelder ldquoTransferringlsquoLibraries of Congressrsquo of Datardquo Library of Congress Digital Preservation

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Blog 11 July 2011wwwblogslocgovdigitalpreservation201107transferring-libraries-of-congress-of-data ltviewed 11 January 2013gt

24 Gleick 232 25 ldquoAbout the Library of Congress Web Archivesrdquo Library of Congress Web

Archiving FAQs Undated wwwlocgovwebarchivingfaqhtmlfaqs_02ltviewed 11 January 2013gt

26 ldquo10000000000000000 Bytes Archivedrdquo Internet Archive Blog 26October 2012 wwwblogarchiveorg2012102610000000000000000-bytes-archivedltviewed 19 December 2012gt

27 Jean-Baptiste Michel et al 176ndash82 Published Online Ahead of Print16 December 2010 wwwsciencemagorgcontent3316014176ltviewed 29 July 2011gt See also Scott Cleland ldquoGooglersquos PiracyLiabilityrdquo Forbes 9 November 2011 wwwforbescomsitesscottcle-land20111109googles-piracy-liability ltviewed 11 January 2013gt

28 ldquoGovernment of Canada Web Archiverdquo Library and Archives Canada 17October 2007 wwwcollectionscanadagccawebarchivesindex-ehtmlltviewed 11 January 2013gt

29 For background see Bill Curry ldquoVisiting Library and Archives inOttawa Not Without an Appointmentrdquo Globe and Mail 1 May 2012wwwtheglobeandmailcomnewspoliticsottawa-notebookvisiting-library-and-archives-in-ottawa-not-without-an-appointmentarticle2418960 ltviewed 22 May 2012gt and the announcement ldquoLAC Begins Implementation of New Approach to service Deliveryrdquo Library and Archives Canada February 2012wwwcollectionscanadagccawhats-new013-560-ehtml ltviewed 22May 2012gt For professional responses see the ldquoCHArsquos comment onNew Directions for Library and Archives Canadardquo CHA Website March2012 wwwcha-shccaenNews_39items19html ltviewed 19December 2012gt and ldquoSave Library amp Archives Canadardquo CanadianAssociation of University Teachers launched AprilMay 2012 wwwsavelibraryarchivesca ltviewed 19 December 2012gt

30 Ian Milligan ldquoThe Smokescreen of lsquoModernizationrsquo at Library andArchives Canadardquo ActiveHistoryca 22 May 2012wwwactivehistoryca201205the-smokescreen-of-modernization-at-library-and-archives-canada ltviewed 19 December 2012gt

31 ldquoSave LAC May 2012 Campaign Updaterdquo SaveLibraryArchivesca May2012 wwwsavelibraryarchivescaupdate-2012-05aspx ltviewed 19December 2012gt

32 The challenges and opportunities presented by such projects is extensively discussed in Daniel J Cohen ldquoThe Future of Preserving the Pastrdquo CRM The Journal of Heritage Stewardship 2 no 2

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

(Summer 2005) 6ndash19 also available freely online at the RoyRosenzweig Center for History and New Mediawwwchnmgmueduessays-on-history-new-mediaessaysessayid=39For more see ldquoSeptember 11 Digital Archive Saving the Histories of September 11 2001rdquo Centre for History and New Media andAmerican Social History Project Center for Media and Learningwww911digitalarchiveorg ltviewed 11 January 2013gt ldquoHurricaneDigital Memory Bank Collecting and Preserving the Stories of Katrina and Ritardquo Center for History and New Media wwwhurricanearchiveorg ltviewed 11 January 2013gt and ldquoOccupyArchive Archiving the Occupy Movements from 2011rdquo Centre forHistory and New Media wwwoccupyarchiveorg ltviewed 11 January2013gt

33 Daniel Cohen and Roy Rosenzweig Digital History A Guide to Gathering Preserving and Presenting the Past on the Web(Philadelphia Temple University Press 2005) chapter on ldquoPreservingDigital Historyrdquo The chapter is available fully online atwwwchnmgmuedudigitalhistorypreserving5php ltviewed 19December 2012gt

34 For an overview see ldquoPreserving Our Digital Heritage The NationalDigital Information Infrastructure and Preservation Program 2010 Report a Collaborative Initiative of the Library of CongressrdquoDigitalpreservationgov wwwdigitalpreservationgovmultimediadocumentsNDIIPP2010Report_Postpdf ltviewed 19 December 2012gt

35 Steve Meloan ldquoNo Way to Run a Culturerdquo Wired 13 February 1998available online via the Internet Archivewwwwebarchiveorgweb20000619001705httpwwwwiredcomnewsculture012841030100html ltviewed 19 December 2012gt

36 Some of this has been discussed in greater depth on my website See theseries beginning with Ian Milligan ldquoWARC Files A Challenge forHistorians and Finding Needles in Haystacksrdquo ianmilliganca 12December 2012 wwwianmilliganca20121212warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks ltviewed 20 December2012gt

37 Unstructured data as opposed to structured data is important for histo-rians but brings additional challenges Some projects have producedstructured data novels marked up using XML encoding for example soone can immediately see place names dates character names and soforth This makes it much easier for a computer to read and understand

38 See William Noblett ldquoDigitization A Cautionary Talerdquo New Review ofAcademic Librarianship 17 (2011) 3 Tobias Blanke Michael Bryantand Mark Hedges ldquoOcropodium Open Source OCR for Small-ScaleHistorical Archivesrdquo Journal of Information Science 38 no 1 (2012) 83

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 26: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

illustration First without knowing anything about the sites we canimmediately learn that it is dealing with CanadaCanadians that itis a collection with documents in French and English that we aredealing with digitized collections and that there is a fair amount ofregional representation not based upon population (note thatldquoontariordquo is less frequent than both ldquonewfoundlandrdquo ldquonova scotiardquoand ldquoalbertardquo although ldquoqueacutebecrdquo and ldquoquebecrdquo is not terribly promi-nent) Certain kinds of history dominate communities familynames the built form (houses churches schools and so forth) At avery macro level we can learn several characteristics before doing anydeeper investigation

These forms of visualization need to be used with extreme cau-tion however First while visually attractive it leaves a severeamount of ambiguity are ldquoontariordquo and ldquoqueacutebecrdquo equally promi-nent for example or is there a slight divergence Even moreproblematically we cannot learn context from the above picturedoes the word ldquoschoolrdquo refer to built structures or educational expe-riences The same question with applies to ldquochurchrdquo Do mentionsof ldquonewfoundlandrdquo refer to historical experiences there or withexpatriates or in travel narratives Word clouds are useful first stepsand have their place but need to be used with extreme caution Formore scholarly applications tables offer much more precise inter-pretation As the second figure below demonstrates they can bemore helpful when dealing with the individual sites cuttingthrough ambiguity The figure below is a visualization of the web-site ldquopast and presentrdquo Despite the ambiguous title we can quicklylearn what this website contains from this visualization Wordle isthus a useful first step analogous to taking magically taking a quicklook through the side of an archival box We still need more infor-mation however to really make this form of textual analysishelpful To navigate this problem I have created my own dynamicfinding aid program using Mathematica which scales to any born-digital web collection

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Two approaches were taken First a textual representation of wordand phrase frequency allows us to determine the content of a given filerelatively quickly I wrote a short program to extract word frequency(stop words excluded) and phrase frequency For phrases we speak ofn-grams bigrams trigrams quadgrams and fivegrams although onecould theoretically do any higher number A bigram can be a combina-tion of two characters such as lsquobirsquo two syllables or mdash pertinent in ourcase mdash two words An example ldquocanada isrdquo A trigram is three wordsquadgram is four words and a fivegram is five words and so on

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Word Cloud of the File ldquoPast and Presentrdquo

A viewer for n-grams in the CDC Collection

The CDC collection demonstrates this approachrsquos utility Usingmy Mathematica textual CDC browser pictured above I randomlyexplored file 267 the ambiguously named ldquopast and presentrdquo web-site This is one of the largest individual websites 122363 words orabout 493 standard typed pages Such an ambiguous name could beabout almost anything Initially I decided to study word frequency

ldquoldquoalbertardquo 758rdquo ldquoldquolistenrdquo 490rdquo ldquoldquosettlersrdquo 467rdquoldquoldquonewrdquo 454rdquo ldquoldquocanadardquo 388rdquo ldquoldquolandrdquo 362rdquo ldquoldquoreadrdquo328rdquo ldquoldquopeoplerdquo 295rdquo ldquoldquosettlementrdquo 276rdquo ldquoldquochineserdquo268rdquo ldquoldquocanadianrdquo 264rdquo ldquoldquoplacerdquo 255rdquo ldquoldquoheritagerdquo254rdquo ldquoldquoedmontonrdquo 250rdquo ldquoldquoyearsrdquo 238rdquo ldquoldquowestrdquo 236rdquoldquoldquoraymondrdquo 218rdquo ldquoldquoukrainianrdquo 216rdquo ldquoldquocamerdquo 213rdquoldquoldquocalgaryrdquo 209rdquo ldquoldquonamesrdquo 206rdquo ldquoldquoranchrdquo 204rdquoldquoldquoupdatedrdquo 193rdquo ldquoldquohistoryrdquo 190rdquo ldquoldquocommunitiesrdquo183rdquo hellip

Rather than having to visit the website at a glance we can almostimmediately see what the site is about More topics become apparentwhen one looks at trigrams for example in this automatically gen-erated output of generated top phrases

ldquoldquofirstrdquo ldquopeoplerdquo ldquosettlersrdquo 100rdquo ldquoldquotherdquo ldquounitedrdquoldquostatesrdquo 72rdquo ldquoldquoonerdquo ldquoofrdquo ldquotherdquo 59rdquo ldquoldquoofrdquo ldquotherdquoldquowestrdquo 58rdquo ldquoldquosettlersrdquo ldquolastrdquo ldquoupdatedrdquo 56rdquo ldquoldquopeoplerdquoldquosettlersrdquo ldquolastrdquo 56rdquo ldquoldquoalbertansrdquo ldquofirstrdquo ldquopeoplerdquo 56rdquoldquoldquoadventurousrdquo ldquoalbertansrdquo ldquofirstrdquo 56rdquo ldquoldquotherdquo ldquobarrdquoldquourdquo 55rdquo ldquoldquoopeningrdquo ldquoofrdquo ldquotherdquo 54rdquo ldquoldquocommunitiesrdquoldquoadventurousrdquo ldquoalbertansrdquo 54rdquo ldquoldquonewrdquo ldquocommunitiesrdquoldquoadventurousrdquo 54rdquo ldquoldquobarrdquo ldquourdquo ldquoranchrdquo 53rdquo ldquoldquolistenrdquoldquotordquo ldquotherdquo 52rdquo hellip

Through a few clicks we can navigate and determine the basiccontours mdash from a distance mdash of what the site pertains to In aviewer you can use a slider to dynamically see more informationor less information quickly toggling between bigrams trigramsand quadgrams allowing you to quickly hone in on immediately

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

relevant information This can be useful but we need to moveback even further to achieve a distant reading of this born-digitalcollection

If one zooms back out to this very distant level we can extractn-gram data from the entire collection We have two approachesfirst aggregate appearances of a given term (if ldquoaboriginalrdquo appears457 times then n=457) or second relative appearances of a giventerm (if ldquoaboriginalrdquo appears in 45 of 384 sites then n=45384 or01171875) In my other work on music lyrics the latter optionenables you to control for the lsquochorusrsquo effect as the former can be drowned out by one song with repetitive words Given thenature of this collection I elected for the former approach Whatcan we learn from trigrams for example After an initial set of trigrams pertaining to the Louisbourg digitization project whichhas boilerplate language on each page as well as common ldquostop-wordrdquo equivalents of phrases (especially pronounced in French) we come across the first set of meaningful phrases excerptedbelow

ldquoexperiencerdquo ldquoinrdquo ldquotherdquo 919 ldquotransferredrdquo ldquotordquo ldquotherdquo918 ldquofundedrdquo ldquobyrdquo ldquotherdquo 918 ldquoofrdquo ldquotherdquo ldquoprovincialrdquo915 ldquothisrdquo ldquowebrdquo ldquositerdquo 915 ldquo409rdquo ldquoregistryrdquo ldquo2rdquo914 ldquoarerdquo ldquonordquo ldquolongerrdquo 914 ldquogouvernementrdquo ldquodurdquo ldquocanadardquo 910 ldquouserdquo ldquothisrdquo ldquolinkrdquo 887ldquopagerdquo ldquoorrdquo ldquoyourdquo 887 ldquofollowingrdquo ldquopagerdquo ldquoorrdquo 887 ldquoautomaticallyrdquo ldquotransferredrdquo ldquotordquo 887 ldquoberdquoldquoautomaticallyrdquo ldquotransferredrdquo 887 ldquowillrdquo ldquoberdquo ldquoautomaticallyrdquo 887 ldquoindustryrdquo ldquoyourdquo ldquowillrdquo 887ldquomultimediardquo ldquoindustryrdquo ldquoyourdquo 887 ldquopracticalrdquo ldquoexperiencerdquo ldquoinrdquo 887 ldquogainedrdquo ldquopracticalrdquo ldquoexperiencerdquo887 ldquotheyrdquo ldquogainedrdquo ldquopracticalrdquo 887 ldquowhilerdquo ldquotheyrdquoldquogainedrdquo 887 ldquocanadiansrdquo ldquowhilerdquo ldquotheyrdquo 887ldquoyoungrdquo ldquocanadiansrdquo ldquowhilerdquo 887 ldquoshowcaserdquo ldquoofrdquoldquoworkrdquo 887 ldquoexcellentrdquo ldquoshowcaserdquo ldquoofrdquo 887 hellip

Here we see an array of common language funding informa-tion link information project information spread across over 887

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

websites Reading down we see information about governments jobexperience project information initiatives re-structuring and soforth At a glance we can see what language is common to all web-sites as opposed to unique information Moving down we find themore common but less universal ldquothe history ofrdquo for example ldquoher-itage community foundationrdquo libraries and so forth By the end ofthe list we can move into the unique or very rare phrases that appearonly once twice or three times

Moving beyond word and phrase frequency there are a fewmore sophisticated off-the-shelf options that can help us make senseof large arrays of unstructured data Topic modeling is among themost developed as Princetonrsquos David M Blei (a Computer Scienceprofessor) describes in a general introduction to the topic

While more and more texts are available online we simplydo not have the human power to read and study them toprovide the kind of [traditional] browsing experiencedescribed above To this end machine learning researchershave developed probabilistic topic modeling a suite of algo-rithms that aim to discover and annotate large archives ofdocuments with thematic information Topic modelingalgorithms are statistical methods that analyze the words ofthe original texts to discover the themes that run throughthem how those themes are connected to each other andhow they change over time hellip Topic modeling enables usto organize and summarize electronic archives at a scalethat would be impossible by human annotation55

Using the aforementioned SEASR Meandre workbench (or viaMALLET itself and a command-line interface) we can quickly beginto run our born-digital information through the topic-modelingalgorithm The results are notable and come in two formats a wordcloud visualization as well as an XML file containing a line-by-linebreakdown of each topic Taking the aforementioned file 267 (ldquopastand presentrdquo) we can begin to get a sense of the topics this large web-site discusses

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

At a glance we can learn about some of the main themes the struc-ture of Alberta (Alberta towns bridges lakes) the geographic shapeof the area (life land lands areas links grass and interestingly lan-guage) the stories (story knight archives but also interestinglyknight and wife) and so forth The word cloud gives us a sense ofthe most relevant terms that define each topic but we also have a3200-line XML file broken into topics which lets us further refineour inquiry

A last set of examples will help us move from the aggregate levelof hundreds of thousands of lines of text to the individual line inessence creating our own search engine to navigate this collection

There are a few good reasons why we should not use Google todo all of this in our own work First we do not know how Googleworks precisely It is a proprietary black box While the original algo-rithm is available the actual methods that go into determining theranking of search terms is a trade secret Second Google is optimizedfor a number of uses mainly to deliver the most relevant informa-tion to the user Historians and archivists are looking often forspecific information within old websites and it behooves us to exam-ine them through the lens of more specialized tools With thesehesitations in mind I have sought to develop my own rudimentarysearch engine for this born-digital collection

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A two-pronged approach can take us to relevant informationquickly In the following example I am carrying out a project onblack history We quickly search all of the CDC word frequency filesfor the occurrence of the word ldquofreedomrdquo ldquoblackrdquo and ldquolibertyrdquo(these terms of course require some background knowledge thesearch is shaped by the user) obtain a list and then map it onto achart in order to quickly visualize their locations in the collection Forexample in the graph above we see a visualization of a given term inthis case our three search terms As one moves the mouse over eachbar the frequency data for a given term appears We can also zoomand enhance this graph to use it as we wish A demonstration of thiscan be found online at wwwianmilliganca20130118cdc-tool-1

We can then move down into the data In the example aboveldquo047rdquo labeled ldquoblackgoldtxtrdquo contains 31 occurrences of the wordldquoontariordquo Moving to my Mathematica keyword-in-context routinewe can see where these occurrences of the word appear

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

A snapshot of the dynamic finding aid tool

We see that this website focuses mostly on southwestern Ontariowhere many oil fields are If we move forward we can see that it per-tains to the nineteenth century a particular region as well assubsequent memorialization activities (mention of a museum forexample) All this can be done from within the textual client asopposed to going out into individual websites Broad contoursemerge from this distant reading but they also facilitate deeper read-ing by allowing us to promptly realize critical sources

Another model through which we can visualize documents isthrough a viewer primarily focusing on term document frequencyand inverse frequency terms discussed below To prepare this mater-ial we are combining our earlier data about word frequency (that theword ldquoexperiencerdquo appears 919 times perhaps) with aggregate infor-mation about frequency in the entire corpus (ie the wordldquoexperiencerdquo appears 1766 times) The number of times that a termappears in a given website can be defined as tf (term frequency)Another opportunity is inverse document frequency where we canlearn what words are most distinct and unique in a given documentThe formula for this is the oft-cited tf-idf a staple of informationretrieval theory or tf times the logarithm of the number of docu-ments divided into the number of documents where the given termappears At a glance then we can see some of the unique words con-tained within a website if we select ldquoinverse document frequencyrdquo

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

At a very quick glance one can see the major words that make this aunique document This technique can be used on any textual docu-ment

Off-the-shelf tools can facilitate a deeper yet distant reading ofindividual websites as well especially the suite of research tools onwebsites such as Many Eyes and Voyant Using these tools howeverrequires that one puts individual data into each separate programrather than quickly navigating between them This stems from themassive amount of files and data involved Many Eyes limits files to5MB and Voyant has no hard and fast rule but can struggle withlarger files

This analysis was carried out on my personal computer55 Largerdata sets may necessitate a corresponding increase in computationalpower bringing social sciences and humanities researchers into con-tact with High Performance Computing (HPC) centres or consortiaThese extremely powerful computers provide researchers with com-puting power orders of magnitude above that of personal computersallowing results in hours instead of the weeks or months it might

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

take to process on a smaller system56 This has particular applicationin the basic processing of a very large web archive There are limita-tions however jobs must be submitted into queues in advancerather than on-site processing and perhaps most importantly HPCcentres prefer to run highly optimized code to minimize usage oftheir very expensive and scarcely allocated equipment As this level ofcoding expertise is often beyond humanities researchers researchteams must be formed

Humanities researchers would do well to familiarize themselveswith local HPC options many of which are more than happy towork on new and engaging questions These are accessible optionsin Ontario for example all but one university belongs to a HPCconsortium57 Yet as personal computing continues to dramaticallyincrease in speed and as storage prices continue to plummet textualanalysis is increasingly feasible on high-quality office systems An ini-tial exposure can generate questions and problems that if the needrears itself can subsequently be taken to the supercomputer

This case study is intended to provide one model for distantreading and the generation of finding aids in the interest of showingsome of the potential for working with born-digital sources It isintended to begin or jumpstart a conversation around the resourcesand skills necessary for the next generation A seemingly insur-mountable amount of information contained across over 73000files (as in the CDC collection) can be quickly navigated and iso-lated for researcher convenience

Conclusion Digital Literacy for the Next Generation ofHistorians

As a discipline I would like to suggest one path forward prepare thenext generation of historians at least some of them to deal with thedeluge of born-digital sources Fundamentally the challenge of born-digital sources is a pedagogical one For existing historians this is anissue of continual education for future historians an issue of profes-sional training (the annual Digital Humanities Summer Instituteheld at the University of Victoria each June is worth considering asan introduction to many digital concepts) Some off-the-shelf tools

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

can serve as gateways to deeper engagement notably the VoyantTools suite (wwwvoyant-toolsorg) and perhaps subsequently thelessons contained in the Programming Historian 2 That said digitalliteracy courses should become part of the essential pantheon of his-torical curricula This should not be mandatory but instead taughtalongside and within traditional historiography and research meth-ods courses In an era of diminishing resources and attention paid toknowledge translation and relevancy digital literacy courses are alsowell positioned to provide graduates with concrete portfolios andactionable skills If we want to be prepared to use digital method-ologies to study the post-1990 world we will need ready historians

Such a course should have six basic priorities First we must pro-vide historiographical context for new students on whatcomputational methods have been tried before How have quantita-tive and qualitative methods differed yet ultimately complementedeach other Second students in this course would need to develop anunderstanding of the digital basics cloud computing effectivebackup and versioning digital security and an understanding of howthe Internet is organized Third for effective cross-departmental andsustained engagement with digital history students must to learn howto digitally navigate conventional sources notable skill sets wouldinclude a familiarity with citation management software databasesand the possibility of using optical character recognition on a varietyof born-paper sources subsequently digitized through scanners orcameras This must also include a grasp of the scope and significanceof large digital depositories including Google Books the InternetArchive and others mentioned earlier Fourth students must learnhow to process large amounts of digital information this includes off-the-shelf visualization tools material discussed here and some of thephilosophical and methodological implications of a distance readingFifth students can learn basic programming through the ProgrammingHistorian Finally and perhaps most importantly students need anunderstanding of various ways to openly disseminate material on the web whether through blogging micro-blogging publishing webpages and so forth it is important to enable the great work that we do in the university to make its mark on the world Throughthis students will also come away from history departments with

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

comprehensive and marketable digital portfolios Such a coursewould focus more on homework than reading with special emphasisplaced on deliverable content such as websites interactive modelsand broadly implementable databases

Digital history and the readiness to work with born-digitalsources cannot stem out of pedagogy alone however Institutionalsupport is necessary While based on anecdotal evidence it appearsthat some university administrators have begun to embrace the shifttowards the digital humanities At the federal level SSHRC has iden-tified the Digital Economy as one of its priority areas58 Yet when itcomes to evaluation doctoral and post-doctoral researchers do nothave a ldquodigital humanitiesrdquo option open to them in terms of assessingcommittees leading to fears that their projects might be lost inbetween disciplinary areas of focus59 In light of the impending crisisof born-digital sources historians should encourage these initiatives

These sorts of digital projects do not always lend themselveswell to traditional publication venues The examples above from theCDC case study for example are best viewed as dynamically mal-leable models this will be one challenge as historians grapple withnew publication models This article argues first why digitalmethodologies are necessary and second how we can realize aspectsof this call The Internet is 20 years old and before historians realizeit we will be faced with the crisis of born-digital information over-load If history is to continue as the leading discipline inunderstanding the social and cultural past decisive movementtowards the digital humanities is necessary When the sea comes inwe must be ready to swim

IAN MILLIGAN is an assistant professor of Canadian history at theUniversity of Waterloo He is a founding co-editor of ActiveHistorycaan editor-at-large of the Programming Historian 2 and has written sev-eral articles in the fields of Canadian youth digital and labour history

IAN MILLIGAN est professeur adjoint en histoire canadienne agrave laUniversity of Waterloo Il est le fondateur et codirecteur du site

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Internet ActiveHistoryca ainsi qursquoun eacutediteur du site ProgrammingHistorian 2 Il a eacutecrit plusieurs articles au sujet de la jeunesse cana-dienne de la numeacuterisation et de lrsquohistoire des travailleurs

Endnotes

1 The question of when the past becomes a ldquohistoryrdquo to be professionallystudied is a tricky one Yet it is worth noting that the still-definitivework on the Baby Boom Generation appeared in 1996 with DougOwram Born at the Right Time A History of the Baby Boom Generationin Canada (Toronto University of Toronto Press 1996) This actuallyfollowed the earlier scholarly monograph Cyril Levitt Children ofPrivilege Student Revolt in the Sixties (Toronto University of TorontoPress 1984) and a host of other laudable works As academicdebatesworks on the 1960s now suggest the past becomes historysooner than we often expect

2 A critical point advanced in the field of literary criticism and theory byStephen Ramsay Reading Machines Towards an Algorithmic Criticism(Urbana University of Illinois Press 2011)

3 See ldquoIdleNoMore Timelinerdquo ongoingwwwidlenomoremakookcatimeline44 ltviewed 18 January 2013gt

4 These issues have been discussed elsewhere albeit in either a moreabridged form or with a focus on digitized historical sources A notableexception is Daniel J Cohen and Roy Rosenzweig Digital History AGuide to Gathering Preserving and Presenting the Past on the Web(Philadelphia University of Pennsylvania Press 2005) specifically theirchapter on ldquoCollecting History onlinerdquo Their focus is on the preservation-ist and archival challenges however Important notes are also found inDan Cohen ldquoHistory and the Second Decade of the Webrdquo RethinkingHistory 8 no 2 (June 2004) 293ndash301 also available online atwwwchnmgmueduessays-on-history-new-mediaessaysessayid=34

5 Figure taken from Jose Martinez ldquoTwitter CEO Dick Costolo Reveals Staggering Number of Tweets Per Dayrdquo ComplexTech wwwcomplexcomtech201210twitter-ceo-dick-costolo-reveals-staggering-number-of-tweets-per-day ltviewed 6 January 2013gtFurther more while a tweet itself is only 140 characters every tweetcomes with over 40 lines of metadata date information author biogra-phy and location information when the account was created the usersrsquofollowing and follower numbers country data and so forth This is aremarkable treasure trove of information for future historians but also apotential source of information overload See Sarah Perez ldquoThis is What

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

a Tweet Looks Likerdquo ReadWriteWeb (19 April 2010) wwwreadwritewebcomarchivesthis_is_what_a_tweet_looks_likephpltviewed 6 January 2013gt

6 This is further discussed in Ian Foster ldquoHow Computation ChangesResearchrdquo in Switching Codes Thinking Through Digital Technology inthe Humanities and the Arts eds Thomas Bartscherer and RoderickCoover (Chicago University of Chicago Press 2011) esp 22ndash4

7 Franco Moretti Graphs Maps Trees Abstract Models for Literary History(New York Verso 2005) 3ndash4

8 While not of prime focus here the longue dureacutee deserves definition Itstands opposite the history of events of instances covering instead avery long time span in an interdisciplinary social science framework In along essay Braudel noted that one can then see both structural crises ofeconomies and structures that constrain human society and develop-ment For more see Fernand Braudel ldquoHistory and the Social SciencesThe Longue Dureacuteerdquo in Fernand Braudel On History trans SarahMatthews (Chicago University of Chicago Press 1980) 27 The essaywas originally printed in the Annales ESC no 4 (OctoberndashDecember1958)

9 A great overview can be found at ldquoText Analysisrdquo Tooling Up for DigitalHumanities 2011 wwwtoolingupstanfordedupage_id=981 ltviewed19 December 2012gt

10 Jean-Baptiste Michel Yuan Kui Shen Aviva Presser Aiden Adrian VeresMatthew K Gray The Google Books Team Joseph P Pickett DaleHoiberg Dan Clancy Peter Norvig Jon Orwant Steven Pinker MartinA Nowak Erez Lieberman Aiden ldquoQuantitative Analysis of CultureUsing Millions of Digitized Booksrdquo Science 331 (14 January 2011)176ndash82 The tool is publicly accessible at ldquoGoogle Books NgramViewerrdquo Google Books 2010 wwwbooksgooglecomngrams ltviewed11 January 2013gt

11 Anthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

12 Comment by Jean-Baptiste Michel and Erez Lieberman Aiden onAnthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

13 For more information see the Canadian Century Research InfrastructureProject website at wwwccriuottawacaCCRIHomehtml and the Programme de recherche en deacutemographie historique at wwwgeneologyumontrealcafrleprdhhtm

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

14 ldquoWith Criminal Intentrdquo CriminalIntentorg wwwcriminalintentorgltviewed 20 December 2012gt

15 For an overview see ldquoAward Recipients mdash 2011rdquo Digging into DataWebpage wwwdiggingintodataorgHomeAwardRecipients2011tabid185Defaultaspx ltviewed 20 December 2012gt

16 One can do no better than the completely free Programming Historianwhich uses entirely free and open-source programming tools to learn thebasics of computational history See The Programming Historian 2 2012-Present wwwprogramminghistorianorg ltviewed 11 January 2013gt

17 Michael Katz The People of Hamilton Canada West Family and Class ina Mid-Nineteenth-Century City (Cambridge MA Harvard UniversityPress 1975) Also see A Gordon Darroch and Michael D OrnsteinldquoEthnicity and Occupational Structure in Canada in 1871 The VerticalMosaic in Historical Perspectiverdquo Canadian Historical Review 61 no 3(1980) 305ndash33

18 Ian Anderson ldquoHistory and Computingrdquo Making History The ChangingFace of the Profession in Britain 2008 wwwhistoryacukmakinghistoryresourcesarticleshistory_and_computinghtml ltviewed 20 December2012gt

19 Christopher Collins ldquoText and Document Visualizationrdquo TheHumanities and Technology Camp Greater Toronto Area VisualizationBootcamp University of Ontario Institute of Technology Oshawa Ont21 October 2011

20 The quotations by Luther and Poe are commonly used to make thispoint For the best example see Clay Shirky Cognitive SurplusCreativity and Generosity in a Connected Age Google eBook Penguin2010 For Mumford see Lewis Mumford The Myth of the Machine vol2 The Pentagon of Power (New York Harcourt Brace 1970) 182 asquoted in Gleick The Information A History A Theory A Flood (NewYork Pantheon 2011) 404

21 Adam Ostrow ldquoCisco CRS-3 Download Library of Congress in JustOver One Secondrdquo Mashable (9 March 2010) accessed onlinewwwmashablecom20100309cisco-crs-3

22 A note on data size is necessary The basic unit of information is a ldquobyterdquo (8 bits) which is roughly a single character A kilobyte (KB) is athousand bytes a megabyte (MB) a million a gigabyte (GB) a billion aterabyte (TB) a trillion and a petabyte (PB) a quadrillion Beyond thisthere are exabytes (EB) zettabytes (ZB) and yottabytes (YB)

23 These figures can of course vary depending on compression methodschosen Their math was based on an average of 8MB per book if it wasscanned at 600dpi and then compressed There are 26 million books inthe Library of Congress For more see Mike Ashenfelder ldquoTransferringlsquoLibraries of Congressrsquo of Datardquo Library of Congress Digital Preservation

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Blog 11 July 2011wwwblogslocgovdigitalpreservation201107transferring-libraries-of-congress-of-data ltviewed 11 January 2013gt

24 Gleick 232 25 ldquoAbout the Library of Congress Web Archivesrdquo Library of Congress Web

Archiving FAQs Undated wwwlocgovwebarchivingfaqhtmlfaqs_02ltviewed 11 January 2013gt

26 ldquo10000000000000000 Bytes Archivedrdquo Internet Archive Blog 26October 2012 wwwblogarchiveorg2012102610000000000000000-bytes-archivedltviewed 19 December 2012gt

27 Jean-Baptiste Michel et al 176ndash82 Published Online Ahead of Print16 December 2010 wwwsciencemagorgcontent3316014176ltviewed 29 July 2011gt See also Scott Cleland ldquoGooglersquos PiracyLiabilityrdquo Forbes 9 November 2011 wwwforbescomsitesscottcle-land20111109googles-piracy-liability ltviewed 11 January 2013gt

28 ldquoGovernment of Canada Web Archiverdquo Library and Archives Canada 17October 2007 wwwcollectionscanadagccawebarchivesindex-ehtmlltviewed 11 January 2013gt

29 For background see Bill Curry ldquoVisiting Library and Archives inOttawa Not Without an Appointmentrdquo Globe and Mail 1 May 2012wwwtheglobeandmailcomnewspoliticsottawa-notebookvisiting-library-and-archives-in-ottawa-not-without-an-appointmentarticle2418960 ltviewed 22 May 2012gt and the announcement ldquoLAC Begins Implementation of New Approach to service Deliveryrdquo Library and Archives Canada February 2012wwwcollectionscanadagccawhats-new013-560-ehtml ltviewed 22May 2012gt For professional responses see the ldquoCHArsquos comment onNew Directions for Library and Archives Canadardquo CHA Website March2012 wwwcha-shccaenNews_39items19html ltviewed 19December 2012gt and ldquoSave Library amp Archives Canadardquo CanadianAssociation of University Teachers launched AprilMay 2012 wwwsavelibraryarchivesca ltviewed 19 December 2012gt

30 Ian Milligan ldquoThe Smokescreen of lsquoModernizationrsquo at Library andArchives Canadardquo ActiveHistoryca 22 May 2012wwwactivehistoryca201205the-smokescreen-of-modernization-at-library-and-archives-canada ltviewed 19 December 2012gt

31 ldquoSave LAC May 2012 Campaign Updaterdquo SaveLibraryArchivesca May2012 wwwsavelibraryarchivescaupdate-2012-05aspx ltviewed 19December 2012gt

32 The challenges and opportunities presented by such projects is extensively discussed in Daniel J Cohen ldquoThe Future of Preserving the Pastrdquo CRM The Journal of Heritage Stewardship 2 no 2

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

(Summer 2005) 6ndash19 also available freely online at the RoyRosenzweig Center for History and New Mediawwwchnmgmueduessays-on-history-new-mediaessaysessayid=39For more see ldquoSeptember 11 Digital Archive Saving the Histories of September 11 2001rdquo Centre for History and New Media andAmerican Social History Project Center for Media and Learningwww911digitalarchiveorg ltviewed 11 January 2013gt ldquoHurricaneDigital Memory Bank Collecting and Preserving the Stories of Katrina and Ritardquo Center for History and New Media wwwhurricanearchiveorg ltviewed 11 January 2013gt and ldquoOccupyArchive Archiving the Occupy Movements from 2011rdquo Centre forHistory and New Media wwwoccupyarchiveorg ltviewed 11 January2013gt

33 Daniel Cohen and Roy Rosenzweig Digital History A Guide to Gathering Preserving and Presenting the Past on the Web(Philadelphia Temple University Press 2005) chapter on ldquoPreservingDigital Historyrdquo The chapter is available fully online atwwwchnmgmuedudigitalhistorypreserving5php ltviewed 19December 2012gt

34 For an overview see ldquoPreserving Our Digital Heritage The NationalDigital Information Infrastructure and Preservation Program 2010 Report a Collaborative Initiative of the Library of CongressrdquoDigitalpreservationgov wwwdigitalpreservationgovmultimediadocumentsNDIIPP2010Report_Postpdf ltviewed 19 December 2012gt

35 Steve Meloan ldquoNo Way to Run a Culturerdquo Wired 13 February 1998available online via the Internet Archivewwwwebarchiveorgweb20000619001705httpwwwwiredcomnewsculture012841030100html ltviewed 19 December 2012gt

36 Some of this has been discussed in greater depth on my website See theseries beginning with Ian Milligan ldquoWARC Files A Challenge forHistorians and Finding Needles in Haystacksrdquo ianmilliganca 12December 2012 wwwianmilliganca20121212warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks ltviewed 20 December2012gt

37 Unstructured data as opposed to structured data is important for histo-rians but brings additional challenges Some projects have producedstructured data novels marked up using XML encoding for example soone can immediately see place names dates character names and soforth This makes it much easier for a computer to read and understand

38 See William Noblett ldquoDigitization A Cautionary Talerdquo New Review ofAcademic Librarianship 17 (2011) 3 Tobias Blanke Michael Bryantand Mark Hedges ldquoOcropodium Open Source OCR for Small-ScaleHistorical Archivesrdquo Journal of Information Science 38 no 1 (2012) 83

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 27: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

Two approaches were taken First a textual representation of wordand phrase frequency allows us to determine the content of a given filerelatively quickly I wrote a short program to extract word frequency(stop words excluded) and phrase frequency For phrases we speak ofn-grams bigrams trigrams quadgrams and fivegrams although onecould theoretically do any higher number A bigram can be a combina-tion of two characters such as lsquobirsquo two syllables or mdash pertinent in ourcase mdash two words An example ldquocanada isrdquo A trigram is three wordsquadgram is four words and a fivegram is five words and so on

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Word Cloud of the File ldquoPast and Presentrdquo

A viewer for n-grams in the CDC Collection

The CDC collection demonstrates this approachrsquos utility Usingmy Mathematica textual CDC browser pictured above I randomlyexplored file 267 the ambiguously named ldquopast and presentrdquo web-site This is one of the largest individual websites 122363 words orabout 493 standard typed pages Such an ambiguous name could beabout almost anything Initially I decided to study word frequency

ldquoldquoalbertardquo 758rdquo ldquoldquolistenrdquo 490rdquo ldquoldquosettlersrdquo 467rdquoldquoldquonewrdquo 454rdquo ldquoldquocanadardquo 388rdquo ldquoldquolandrdquo 362rdquo ldquoldquoreadrdquo328rdquo ldquoldquopeoplerdquo 295rdquo ldquoldquosettlementrdquo 276rdquo ldquoldquochineserdquo268rdquo ldquoldquocanadianrdquo 264rdquo ldquoldquoplacerdquo 255rdquo ldquoldquoheritagerdquo254rdquo ldquoldquoedmontonrdquo 250rdquo ldquoldquoyearsrdquo 238rdquo ldquoldquowestrdquo 236rdquoldquoldquoraymondrdquo 218rdquo ldquoldquoukrainianrdquo 216rdquo ldquoldquocamerdquo 213rdquoldquoldquocalgaryrdquo 209rdquo ldquoldquonamesrdquo 206rdquo ldquoldquoranchrdquo 204rdquoldquoldquoupdatedrdquo 193rdquo ldquoldquohistoryrdquo 190rdquo ldquoldquocommunitiesrdquo183rdquo hellip

Rather than having to visit the website at a glance we can almostimmediately see what the site is about More topics become apparentwhen one looks at trigrams for example in this automatically gen-erated output of generated top phrases

ldquoldquofirstrdquo ldquopeoplerdquo ldquosettlersrdquo 100rdquo ldquoldquotherdquo ldquounitedrdquoldquostatesrdquo 72rdquo ldquoldquoonerdquo ldquoofrdquo ldquotherdquo 59rdquo ldquoldquoofrdquo ldquotherdquoldquowestrdquo 58rdquo ldquoldquosettlersrdquo ldquolastrdquo ldquoupdatedrdquo 56rdquo ldquoldquopeoplerdquoldquosettlersrdquo ldquolastrdquo 56rdquo ldquoldquoalbertansrdquo ldquofirstrdquo ldquopeoplerdquo 56rdquoldquoldquoadventurousrdquo ldquoalbertansrdquo ldquofirstrdquo 56rdquo ldquoldquotherdquo ldquobarrdquoldquourdquo 55rdquo ldquoldquoopeningrdquo ldquoofrdquo ldquotherdquo 54rdquo ldquoldquocommunitiesrdquoldquoadventurousrdquo ldquoalbertansrdquo 54rdquo ldquoldquonewrdquo ldquocommunitiesrdquoldquoadventurousrdquo 54rdquo ldquoldquobarrdquo ldquourdquo ldquoranchrdquo 53rdquo ldquoldquolistenrdquoldquotordquo ldquotherdquo 52rdquo hellip

Through a few clicks we can navigate and determine the basiccontours mdash from a distance mdash of what the site pertains to In aviewer you can use a slider to dynamically see more informationor less information quickly toggling between bigrams trigramsand quadgrams allowing you to quickly hone in on immediately

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

relevant information This can be useful but we need to moveback even further to achieve a distant reading of this born-digitalcollection

If one zooms back out to this very distant level we can extractn-gram data from the entire collection We have two approachesfirst aggregate appearances of a given term (if ldquoaboriginalrdquo appears457 times then n=457) or second relative appearances of a giventerm (if ldquoaboriginalrdquo appears in 45 of 384 sites then n=45384 or01171875) In my other work on music lyrics the latter optionenables you to control for the lsquochorusrsquo effect as the former can be drowned out by one song with repetitive words Given thenature of this collection I elected for the former approach Whatcan we learn from trigrams for example After an initial set of trigrams pertaining to the Louisbourg digitization project whichhas boilerplate language on each page as well as common ldquostop-wordrdquo equivalents of phrases (especially pronounced in French) we come across the first set of meaningful phrases excerptedbelow

ldquoexperiencerdquo ldquoinrdquo ldquotherdquo 919 ldquotransferredrdquo ldquotordquo ldquotherdquo918 ldquofundedrdquo ldquobyrdquo ldquotherdquo 918 ldquoofrdquo ldquotherdquo ldquoprovincialrdquo915 ldquothisrdquo ldquowebrdquo ldquositerdquo 915 ldquo409rdquo ldquoregistryrdquo ldquo2rdquo914 ldquoarerdquo ldquonordquo ldquolongerrdquo 914 ldquogouvernementrdquo ldquodurdquo ldquocanadardquo 910 ldquouserdquo ldquothisrdquo ldquolinkrdquo 887ldquopagerdquo ldquoorrdquo ldquoyourdquo 887 ldquofollowingrdquo ldquopagerdquo ldquoorrdquo 887 ldquoautomaticallyrdquo ldquotransferredrdquo ldquotordquo 887 ldquoberdquoldquoautomaticallyrdquo ldquotransferredrdquo 887 ldquowillrdquo ldquoberdquo ldquoautomaticallyrdquo 887 ldquoindustryrdquo ldquoyourdquo ldquowillrdquo 887ldquomultimediardquo ldquoindustryrdquo ldquoyourdquo 887 ldquopracticalrdquo ldquoexperiencerdquo ldquoinrdquo 887 ldquogainedrdquo ldquopracticalrdquo ldquoexperiencerdquo887 ldquotheyrdquo ldquogainedrdquo ldquopracticalrdquo 887 ldquowhilerdquo ldquotheyrdquoldquogainedrdquo 887 ldquocanadiansrdquo ldquowhilerdquo ldquotheyrdquo 887ldquoyoungrdquo ldquocanadiansrdquo ldquowhilerdquo 887 ldquoshowcaserdquo ldquoofrdquoldquoworkrdquo 887 ldquoexcellentrdquo ldquoshowcaserdquo ldquoofrdquo 887 hellip

Here we see an array of common language funding informa-tion link information project information spread across over 887

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

websites Reading down we see information about governments jobexperience project information initiatives re-structuring and soforth At a glance we can see what language is common to all web-sites as opposed to unique information Moving down we find themore common but less universal ldquothe history ofrdquo for example ldquoher-itage community foundationrdquo libraries and so forth By the end ofthe list we can move into the unique or very rare phrases that appearonly once twice or three times

Moving beyond word and phrase frequency there are a fewmore sophisticated off-the-shelf options that can help us make senseof large arrays of unstructured data Topic modeling is among themost developed as Princetonrsquos David M Blei (a Computer Scienceprofessor) describes in a general introduction to the topic

While more and more texts are available online we simplydo not have the human power to read and study them toprovide the kind of [traditional] browsing experiencedescribed above To this end machine learning researchershave developed probabilistic topic modeling a suite of algo-rithms that aim to discover and annotate large archives ofdocuments with thematic information Topic modelingalgorithms are statistical methods that analyze the words ofthe original texts to discover the themes that run throughthem how those themes are connected to each other andhow they change over time hellip Topic modeling enables usto organize and summarize electronic archives at a scalethat would be impossible by human annotation55

Using the aforementioned SEASR Meandre workbench (or viaMALLET itself and a command-line interface) we can quickly beginto run our born-digital information through the topic-modelingalgorithm The results are notable and come in two formats a wordcloud visualization as well as an XML file containing a line-by-linebreakdown of each topic Taking the aforementioned file 267 (ldquopastand presentrdquo) we can begin to get a sense of the topics this large web-site discusses

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

At a glance we can learn about some of the main themes the struc-ture of Alberta (Alberta towns bridges lakes) the geographic shapeof the area (life land lands areas links grass and interestingly lan-guage) the stories (story knight archives but also interestinglyknight and wife) and so forth The word cloud gives us a sense ofthe most relevant terms that define each topic but we also have a3200-line XML file broken into topics which lets us further refineour inquiry

A last set of examples will help us move from the aggregate levelof hundreds of thousands of lines of text to the individual line inessence creating our own search engine to navigate this collection

There are a few good reasons why we should not use Google todo all of this in our own work First we do not know how Googleworks precisely It is a proprietary black box While the original algo-rithm is available the actual methods that go into determining theranking of search terms is a trade secret Second Google is optimizedfor a number of uses mainly to deliver the most relevant informa-tion to the user Historians and archivists are looking often forspecific information within old websites and it behooves us to exam-ine them through the lens of more specialized tools With thesehesitations in mind I have sought to develop my own rudimentarysearch engine for this born-digital collection

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A two-pronged approach can take us to relevant informationquickly In the following example I am carrying out a project onblack history We quickly search all of the CDC word frequency filesfor the occurrence of the word ldquofreedomrdquo ldquoblackrdquo and ldquolibertyrdquo(these terms of course require some background knowledge thesearch is shaped by the user) obtain a list and then map it onto achart in order to quickly visualize their locations in the collection Forexample in the graph above we see a visualization of a given term inthis case our three search terms As one moves the mouse over eachbar the frequency data for a given term appears We can also zoomand enhance this graph to use it as we wish A demonstration of thiscan be found online at wwwianmilliganca20130118cdc-tool-1

We can then move down into the data In the example aboveldquo047rdquo labeled ldquoblackgoldtxtrdquo contains 31 occurrences of the wordldquoontariordquo Moving to my Mathematica keyword-in-context routinewe can see where these occurrences of the word appear

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

A snapshot of the dynamic finding aid tool

We see that this website focuses mostly on southwestern Ontariowhere many oil fields are If we move forward we can see that it per-tains to the nineteenth century a particular region as well assubsequent memorialization activities (mention of a museum forexample) All this can be done from within the textual client asopposed to going out into individual websites Broad contoursemerge from this distant reading but they also facilitate deeper read-ing by allowing us to promptly realize critical sources

Another model through which we can visualize documents isthrough a viewer primarily focusing on term document frequencyand inverse frequency terms discussed below To prepare this mater-ial we are combining our earlier data about word frequency (that theword ldquoexperiencerdquo appears 919 times perhaps) with aggregate infor-mation about frequency in the entire corpus (ie the wordldquoexperiencerdquo appears 1766 times) The number of times that a termappears in a given website can be defined as tf (term frequency)Another opportunity is inverse document frequency where we canlearn what words are most distinct and unique in a given documentThe formula for this is the oft-cited tf-idf a staple of informationretrieval theory or tf times the logarithm of the number of docu-ments divided into the number of documents where the given termappears At a glance then we can see some of the unique words con-tained within a website if we select ldquoinverse document frequencyrdquo

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

At a very quick glance one can see the major words that make this aunique document This technique can be used on any textual docu-ment

Off-the-shelf tools can facilitate a deeper yet distant reading ofindividual websites as well especially the suite of research tools onwebsites such as Many Eyes and Voyant Using these tools howeverrequires that one puts individual data into each separate programrather than quickly navigating between them This stems from themassive amount of files and data involved Many Eyes limits files to5MB and Voyant has no hard and fast rule but can struggle withlarger files

This analysis was carried out on my personal computer55 Largerdata sets may necessitate a corresponding increase in computationalpower bringing social sciences and humanities researchers into con-tact with High Performance Computing (HPC) centres or consortiaThese extremely powerful computers provide researchers with com-puting power orders of magnitude above that of personal computersallowing results in hours instead of the weeks or months it might

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

take to process on a smaller system56 This has particular applicationin the basic processing of a very large web archive There are limita-tions however jobs must be submitted into queues in advancerather than on-site processing and perhaps most importantly HPCcentres prefer to run highly optimized code to minimize usage oftheir very expensive and scarcely allocated equipment As this level ofcoding expertise is often beyond humanities researchers researchteams must be formed

Humanities researchers would do well to familiarize themselveswith local HPC options many of which are more than happy towork on new and engaging questions These are accessible optionsin Ontario for example all but one university belongs to a HPCconsortium57 Yet as personal computing continues to dramaticallyincrease in speed and as storage prices continue to plummet textualanalysis is increasingly feasible on high-quality office systems An ini-tial exposure can generate questions and problems that if the needrears itself can subsequently be taken to the supercomputer

This case study is intended to provide one model for distantreading and the generation of finding aids in the interest of showingsome of the potential for working with born-digital sources It isintended to begin or jumpstart a conversation around the resourcesand skills necessary for the next generation A seemingly insur-mountable amount of information contained across over 73000files (as in the CDC collection) can be quickly navigated and iso-lated for researcher convenience

Conclusion Digital Literacy for the Next Generation ofHistorians

As a discipline I would like to suggest one path forward prepare thenext generation of historians at least some of them to deal with thedeluge of born-digital sources Fundamentally the challenge of born-digital sources is a pedagogical one For existing historians this is anissue of continual education for future historians an issue of profes-sional training (the annual Digital Humanities Summer Instituteheld at the University of Victoria each June is worth considering asan introduction to many digital concepts) Some off-the-shelf tools

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

can serve as gateways to deeper engagement notably the VoyantTools suite (wwwvoyant-toolsorg) and perhaps subsequently thelessons contained in the Programming Historian 2 That said digitalliteracy courses should become part of the essential pantheon of his-torical curricula This should not be mandatory but instead taughtalongside and within traditional historiography and research meth-ods courses In an era of diminishing resources and attention paid toknowledge translation and relevancy digital literacy courses are alsowell positioned to provide graduates with concrete portfolios andactionable skills If we want to be prepared to use digital method-ologies to study the post-1990 world we will need ready historians

Such a course should have six basic priorities First we must pro-vide historiographical context for new students on whatcomputational methods have been tried before How have quantita-tive and qualitative methods differed yet ultimately complementedeach other Second students in this course would need to develop anunderstanding of the digital basics cloud computing effectivebackup and versioning digital security and an understanding of howthe Internet is organized Third for effective cross-departmental andsustained engagement with digital history students must to learn howto digitally navigate conventional sources notable skill sets wouldinclude a familiarity with citation management software databasesand the possibility of using optical character recognition on a varietyof born-paper sources subsequently digitized through scanners orcameras This must also include a grasp of the scope and significanceof large digital depositories including Google Books the InternetArchive and others mentioned earlier Fourth students must learnhow to process large amounts of digital information this includes off-the-shelf visualization tools material discussed here and some of thephilosophical and methodological implications of a distance readingFifth students can learn basic programming through the ProgrammingHistorian Finally and perhaps most importantly students need anunderstanding of various ways to openly disseminate material on the web whether through blogging micro-blogging publishing webpages and so forth it is important to enable the great work that we do in the university to make its mark on the world Throughthis students will also come away from history departments with

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

comprehensive and marketable digital portfolios Such a coursewould focus more on homework than reading with special emphasisplaced on deliverable content such as websites interactive modelsand broadly implementable databases

Digital history and the readiness to work with born-digitalsources cannot stem out of pedagogy alone however Institutionalsupport is necessary While based on anecdotal evidence it appearsthat some university administrators have begun to embrace the shifttowards the digital humanities At the federal level SSHRC has iden-tified the Digital Economy as one of its priority areas58 Yet when itcomes to evaluation doctoral and post-doctoral researchers do nothave a ldquodigital humanitiesrdquo option open to them in terms of assessingcommittees leading to fears that their projects might be lost inbetween disciplinary areas of focus59 In light of the impending crisisof born-digital sources historians should encourage these initiatives

These sorts of digital projects do not always lend themselveswell to traditional publication venues The examples above from theCDC case study for example are best viewed as dynamically mal-leable models this will be one challenge as historians grapple withnew publication models This article argues first why digitalmethodologies are necessary and second how we can realize aspectsof this call The Internet is 20 years old and before historians realizeit we will be faced with the crisis of born-digital information over-load If history is to continue as the leading discipline inunderstanding the social and cultural past decisive movementtowards the digital humanities is necessary When the sea comes inwe must be ready to swim

IAN MILLIGAN is an assistant professor of Canadian history at theUniversity of Waterloo He is a founding co-editor of ActiveHistorycaan editor-at-large of the Programming Historian 2 and has written sev-eral articles in the fields of Canadian youth digital and labour history

IAN MILLIGAN est professeur adjoint en histoire canadienne agrave laUniversity of Waterloo Il est le fondateur et codirecteur du site

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Internet ActiveHistoryca ainsi qursquoun eacutediteur du site ProgrammingHistorian 2 Il a eacutecrit plusieurs articles au sujet de la jeunesse cana-dienne de la numeacuterisation et de lrsquohistoire des travailleurs

Endnotes

1 The question of when the past becomes a ldquohistoryrdquo to be professionallystudied is a tricky one Yet it is worth noting that the still-definitivework on the Baby Boom Generation appeared in 1996 with DougOwram Born at the Right Time A History of the Baby Boom Generationin Canada (Toronto University of Toronto Press 1996) This actuallyfollowed the earlier scholarly monograph Cyril Levitt Children ofPrivilege Student Revolt in the Sixties (Toronto University of TorontoPress 1984) and a host of other laudable works As academicdebatesworks on the 1960s now suggest the past becomes historysooner than we often expect

2 A critical point advanced in the field of literary criticism and theory byStephen Ramsay Reading Machines Towards an Algorithmic Criticism(Urbana University of Illinois Press 2011)

3 See ldquoIdleNoMore Timelinerdquo ongoingwwwidlenomoremakookcatimeline44 ltviewed 18 January 2013gt

4 These issues have been discussed elsewhere albeit in either a moreabridged form or with a focus on digitized historical sources A notableexception is Daniel J Cohen and Roy Rosenzweig Digital History AGuide to Gathering Preserving and Presenting the Past on the Web(Philadelphia University of Pennsylvania Press 2005) specifically theirchapter on ldquoCollecting History onlinerdquo Their focus is on the preservation-ist and archival challenges however Important notes are also found inDan Cohen ldquoHistory and the Second Decade of the Webrdquo RethinkingHistory 8 no 2 (June 2004) 293ndash301 also available online atwwwchnmgmueduessays-on-history-new-mediaessaysessayid=34

5 Figure taken from Jose Martinez ldquoTwitter CEO Dick Costolo Reveals Staggering Number of Tweets Per Dayrdquo ComplexTech wwwcomplexcomtech201210twitter-ceo-dick-costolo-reveals-staggering-number-of-tweets-per-day ltviewed 6 January 2013gtFurther more while a tweet itself is only 140 characters every tweetcomes with over 40 lines of metadata date information author biogra-phy and location information when the account was created the usersrsquofollowing and follower numbers country data and so forth This is aremarkable treasure trove of information for future historians but also apotential source of information overload See Sarah Perez ldquoThis is What

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

a Tweet Looks Likerdquo ReadWriteWeb (19 April 2010) wwwreadwritewebcomarchivesthis_is_what_a_tweet_looks_likephpltviewed 6 January 2013gt

6 This is further discussed in Ian Foster ldquoHow Computation ChangesResearchrdquo in Switching Codes Thinking Through Digital Technology inthe Humanities and the Arts eds Thomas Bartscherer and RoderickCoover (Chicago University of Chicago Press 2011) esp 22ndash4

7 Franco Moretti Graphs Maps Trees Abstract Models for Literary History(New York Verso 2005) 3ndash4

8 While not of prime focus here the longue dureacutee deserves definition Itstands opposite the history of events of instances covering instead avery long time span in an interdisciplinary social science framework In along essay Braudel noted that one can then see both structural crises ofeconomies and structures that constrain human society and develop-ment For more see Fernand Braudel ldquoHistory and the Social SciencesThe Longue Dureacuteerdquo in Fernand Braudel On History trans SarahMatthews (Chicago University of Chicago Press 1980) 27 The essaywas originally printed in the Annales ESC no 4 (OctoberndashDecember1958)

9 A great overview can be found at ldquoText Analysisrdquo Tooling Up for DigitalHumanities 2011 wwwtoolingupstanfordedupage_id=981 ltviewed19 December 2012gt

10 Jean-Baptiste Michel Yuan Kui Shen Aviva Presser Aiden Adrian VeresMatthew K Gray The Google Books Team Joseph P Pickett DaleHoiberg Dan Clancy Peter Norvig Jon Orwant Steven Pinker MartinA Nowak Erez Lieberman Aiden ldquoQuantitative Analysis of CultureUsing Millions of Digitized Booksrdquo Science 331 (14 January 2011)176ndash82 The tool is publicly accessible at ldquoGoogle Books NgramViewerrdquo Google Books 2010 wwwbooksgooglecomngrams ltviewed11 January 2013gt

11 Anthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

12 Comment by Jean-Baptiste Michel and Erez Lieberman Aiden onAnthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

13 For more information see the Canadian Century Research InfrastructureProject website at wwwccriuottawacaCCRIHomehtml and the Programme de recherche en deacutemographie historique at wwwgeneologyumontrealcafrleprdhhtm

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

14 ldquoWith Criminal Intentrdquo CriminalIntentorg wwwcriminalintentorgltviewed 20 December 2012gt

15 For an overview see ldquoAward Recipients mdash 2011rdquo Digging into DataWebpage wwwdiggingintodataorgHomeAwardRecipients2011tabid185Defaultaspx ltviewed 20 December 2012gt

16 One can do no better than the completely free Programming Historianwhich uses entirely free and open-source programming tools to learn thebasics of computational history See The Programming Historian 2 2012-Present wwwprogramminghistorianorg ltviewed 11 January 2013gt

17 Michael Katz The People of Hamilton Canada West Family and Class ina Mid-Nineteenth-Century City (Cambridge MA Harvard UniversityPress 1975) Also see A Gordon Darroch and Michael D OrnsteinldquoEthnicity and Occupational Structure in Canada in 1871 The VerticalMosaic in Historical Perspectiverdquo Canadian Historical Review 61 no 3(1980) 305ndash33

18 Ian Anderson ldquoHistory and Computingrdquo Making History The ChangingFace of the Profession in Britain 2008 wwwhistoryacukmakinghistoryresourcesarticleshistory_and_computinghtml ltviewed 20 December2012gt

19 Christopher Collins ldquoText and Document Visualizationrdquo TheHumanities and Technology Camp Greater Toronto Area VisualizationBootcamp University of Ontario Institute of Technology Oshawa Ont21 October 2011

20 The quotations by Luther and Poe are commonly used to make thispoint For the best example see Clay Shirky Cognitive SurplusCreativity and Generosity in a Connected Age Google eBook Penguin2010 For Mumford see Lewis Mumford The Myth of the Machine vol2 The Pentagon of Power (New York Harcourt Brace 1970) 182 asquoted in Gleick The Information A History A Theory A Flood (NewYork Pantheon 2011) 404

21 Adam Ostrow ldquoCisco CRS-3 Download Library of Congress in JustOver One Secondrdquo Mashable (9 March 2010) accessed onlinewwwmashablecom20100309cisco-crs-3

22 A note on data size is necessary The basic unit of information is a ldquobyterdquo (8 bits) which is roughly a single character A kilobyte (KB) is athousand bytes a megabyte (MB) a million a gigabyte (GB) a billion aterabyte (TB) a trillion and a petabyte (PB) a quadrillion Beyond thisthere are exabytes (EB) zettabytes (ZB) and yottabytes (YB)

23 These figures can of course vary depending on compression methodschosen Their math was based on an average of 8MB per book if it wasscanned at 600dpi and then compressed There are 26 million books inthe Library of Congress For more see Mike Ashenfelder ldquoTransferringlsquoLibraries of Congressrsquo of Datardquo Library of Congress Digital Preservation

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Blog 11 July 2011wwwblogslocgovdigitalpreservation201107transferring-libraries-of-congress-of-data ltviewed 11 January 2013gt

24 Gleick 232 25 ldquoAbout the Library of Congress Web Archivesrdquo Library of Congress Web

Archiving FAQs Undated wwwlocgovwebarchivingfaqhtmlfaqs_02ltviewed 11 January 2013gt

26 ldquo10000000000000000 Bytes Archivedrdquo Internet Archive Blog 26October 2012 wwwblogarchiveorg2012102610000000000000000-bytes-archivedltviewed 19 December 2012gt

27 Jean-Baptiste Michel et al 176ndash82 Published Online Ahead of Print16 December 2010 wwwsciencemagorgcontent3316014176ltviewed 29 July 2011gt See also Scott Cleland ldquoGooglersquos PiracyLiabilityrdquo Forbes 9 November 2011 wwwforbescomsitesscottcle-land20111109googles-piracy-liability ltviewed 11 January 2013gt

28 ldquoGovernment of Canada Web Archiverdquo Library and Archives Canada 17October 2007 wwwcollectionscanadagccawebarchivesindex-ehtmlltviewed 11 January 2013gt

29 For background see Bill Curry ldquoVisiting Library and Archives inOttawa Not Without an Appointmentrdquo Globe and Mail 1 May 2012wwwtheglobeandmailcomnewspoliticsottawa-notebookvisiting-library-and-archives-in-ottawa-not-without-an-appointmentarticle2418960 ltviewed 22 May 2012gt and the announcement ldquoLAC Begins Implementation of New Approach to service Deliveryrdquo Library and Archives Canada February 2012wwwcollectionscanadagccawhats-new013-560-ehtml ltviewed 22May 2012gt For professional responses see the ldquoCHArsquos comment onNew Directions for Library and Archives Canadardquo CHA Website March2012 wwwcha-shccaenNews_39items19html ltviewed 19December 2012gt and ldquoSave Library amp Archives Canadardquo CanadianAssociation of University Teachers launched AprilMay 2012 wwwsavelibraryarchivesca ltviewed 19 December 2012gt

30 Ian Milligan ldquoThe Smokescreen of lsquoModernizationrsquo at Library andArchives Canadardquo ActiveHistoryca 22 May 2012wwwactivehistoryca201205the-smokescreen-of-modernization-at-library-and-archives-canada ltviewed 19 December 2012gt

31 ldquoSave LAC May 2012 Campaign Updaterdquo SaveLibraryArchivesca May2012 wwwsavelibraryarchivescaupdate-2012-05aspx ltviewed 19December 2012gt

32 The challenges and opportunities presented by such projects is extensively discussed in Daniel J Cohen ldquoThe Future of Preserving the Pastrdquo CRM The Journal of Heritage Stewardship 2 no 2

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

(Summer 2005) 6ndash19 also available freely online at the RoyRosenzweig Center for History and New Mediawwwchnmgmueduessays-on-history-new-mediaessaysessayid=39For more see ldquoSeptember 11 Digital Archive Saving the Histories of September 11 2001rdquo Centre for History and New Media andAmerican Social History Project Center for Media and Learningwww911digitalarchiveorg ltviewed 11 January 2013gt ldquoHurricaneDigital Memory Bank Collecting and Preserving the Stories of Katrina and Ritardquo Center for History and New Media wwwhurricanearchiveorg ltviewed 11 January 2013gt and ldquoOccupyArchive Archiving the Occupy Movements from 2011rdquo Centre forHistory and New Media wwwoccupyarchiveorg ltviewed 11 January2013gt

33 Daniel Cohen and Roy Rosenzweig Digital History A Guide to Gathering Preserving and Presenting the Past on the Web(Philadelphia Temple University Press 2005) chapter on ldquoPreservingDigital Historyrdquo The chapter is available fully online atwwwchnmgmuedudigitalhistorypreserving5php ltviewed 19December 2012gt

34 For an overview see ldquoPreserving Our Digital Heritage The NationalDigital Information Infrastructure and Preservation Program 2010 Report a Collaborative Initiative of the Library of CongressrdquoDigitalpreservationgov wwwdigitalpreservationgovmultimediadocumentsNDIIPP2010Report_Postpdf ltviewed 19 December 2012gt

35 Steve Meloan ldquoNo Way to Run a Culturerdquo Wired 13 February 1998available online via the Internet Archivewwwwebarchiveorgweb20000619001705httpwwwwiredcomnewsculture012841030100html ltviewed 19 December 2012gt

36 Some of this has been discussed in greater depth on my website See theseries beginning with Ian Milligan ldquoWARC Files A Challenge forHistorians and Finding Needles in Haystacksrdquo ianmilliganca 12December 2012 wwwianmilliganca20121212warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks ltviewed 20 December2012gt

37 Unstructured data as opposed to structured data is important for histo-rians but brings additional challenges Some projects have producedstructured data novels marked up using XML encoding for example soone can immediately see place names dates character names and soforth This makes it much easier for a computer to read and understand

38 See William Noblett ldquoDigitization A Cautionary Talerdquo New Review ofAcademic Librarianship 17 (2011) 3 Tobias Blanke Michael Bryantand Mark Hedges ldquoOcropodium Open Source OCR for Small-ScaleHistorical Archivesrdquo Journal of Information Science 38 no 1 (2012) 83

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 28: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

The CDC collection demonstrates this approachrsquos utility Usingmy Mathematica textual CDC browser pictured above I randomlyexplored file 267 the ambiguously named ldquopast and presentrdquo web-site This is one of the largest individual websites 122363 words orabout 493 standard typed pages Such an ambiguous name could beabout almost anything Initially I decided to study word frequency

ldquoldquoalbertardquo 758rdquo ldquoldquolistenrdquo 490rdquo ldquoldquosettlersrdquo 467rdquoldquoldquonewrdquo 454rdquo ldquoldquocanadardquo 388rdquo ldquoldquolandrdquo 362rdquo ldquoldquoreadrdquo328rdquo ldquoldquopeoplerdquo 295rdquo ldquoldquosettlementrdquo 276rdquo ldquoldquochineserdquo268rdquo ldquoldquocanadianrdquo 264rdquo ldquoldquoplacerdquo 255rdquo ldquoldquoheritagerdquo254rdquo ldquoldquoedmontonrdquo 250rdquo ldquoldquoyearsrdquo 238rdquo ldquoldquowestrdquo 236rdquoldquoldquoraymondrdquo 218rdquo ldquoldquoukrainianrdquo 216rdquo ldquoldquocamerdquo 213rdquoldquoldquocalgaryrdquo 209rdquo ldquoldquonamesrdquo 206rdquo ldquoldquoranchrdquo 204rdquoldquoldquoupdatedrdquo 193rdquo ldquoldquohistoryrdquo 190rdquo ldquoldquocommunitiesrdquo183rdquo hellip

Rather than having to visit the website at a glance we can almostimmediately see what the site is about More topics become apparentwhen one looks at trigrams for example in this automatically gen-erated output of generated top phrases

ldquoldquofirstrdquo ldquopeoplerdquo ldquosettlersrdquo 100rdquo ldquoldquotherdquo ldquounitedrdquoldquostatesrdquo 72rdquo ldquoldquoonerdquo ldquoofrdquo ldquotherdquo 59rdquo ldquoldquoofrdquo ldquotherdquoldquowestrdquo 58rdquo ldquoldquosettlersrdquo ldquolastrdquo ldquoupdatedrdquo 56rdquo ldquoldquopeoplerdquoldquosettlersrdquo ldquolastrdquo 56rdquo ldquoldquoalbertansrdquo ldquofirstrdquo ldquopeoplerdquo 56rdquoldquoldquoadventurousrdquo ldquoalbertansrdquo ldquofirstrdquo 56rdquo ldquoldquotherdquo ldquobarrdquoldquourdquo 55rdquo ldquoldquoopeningrdquo ldquoofrdquo ldquotherdquo 54rdquo ldquoldquocommunitiesrdquoldquoadventurousrdquo ldquoalbertansrdquo 54rdquo ldquoldquonewrdquo ldquocommunitiesrdquoldquoadventurousrdquo 54rdquo ldquoldquobarrdquo ldquourdquo ldquoranchrdquo 53rdquo ldquoldquolistenrdquoldquotordquo ldquotherdquo 52rdquo hellip

Through a few clicks we can navigate and determine the basiccontours mdash from a distance mdash of what the site pertains to In aviewer you can use a slider to dynamically see more informationor less information quickly toggling between bigrams trigramsand quadgrams allowing you to quickly hone in on immediately

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

relevant information This can be useful but we need to moveback even further to achieve a distant reading of this born-digitalcollection

If one zooms back out to this very distant level we can extractn-gram data from the entire collection We have two approachesfirst aggregate appearances of a given term (if ldquoaboriginalrdquo appears457 times then n=457) or second relative appearances of a giventerm (if ldquoaboriginalrdquo appears in 45 of 384 sites then n=45384 or01171875) In my other work on music lyrics the latter optionenables you to control for the lsquochorusrsquo effect as the former can be drowned out by one song with repetitive words Given thenature of this collection I elected for the former approach Whatcan we learn from trigrams for example After an initial set of trigrams pertaining to the Louisbourg digitization project whichhas boilerplate language on each page as well as common ldquostop-wordrdquo equivalents of phrases (especially pronounced in French) we come across the first set of meaningful phrases excerptedbelow

ldquoexperiencerdquo ldquoinrdquo ldquotherdquo 919 ldquotransferredrdquo ldquotordquo ldquotherdquo918 ldquofundedrdquo ldquobyrdquo ldquotherdquo 918 ldquoofrdquo ldquotherdquo ldquoprovincialrdquo915 ldquothisrdquo ldquowebrdquo ldquositerdquo 915 ldquo409rdquo ldquoregistryrdquo ldquo2rdquo914 ldquoarerdquo ldquonordquo ldquolongerrdquo 914 ldquogouvernementrdquo ldquodurdquo ldquocanadardquo 910 ldquouserdquo ldquothisrdquo ldquolinkrdquo 887ldquopagerdquo ldquoorrdquo ldquoyourdquo 887 ldquofollowingrdquo ldquopagerdquo ldquoorrdquo 887 ldquoautomaticallyrdquo ldquotransferredrdquo ldquotordquo 887 ldquoberdquoldquoautomaticallyrdquo ldquotransferredrdquo 887 ldquowillrdquo ldquoberdquo ldquoautomaticallyrdquo 887 ldquoindustryrdquo ldquoyourdquo ldquowillrdquo 887ldquomultimediardquo ldquoindustryrdquo ldquoyourdquo 887 ldquopracticalrdquo ldquoexperiencerdquo ldquoinrdquo 887 ldquogainedrdquo ldquopracticalrdquo ldquoexperiencerdquo887 ldquotheyrdquo ldquogainedrdquo ldquopracticalrdquo 887 ldquowhilerdquo ldquotheyrdquoldquogainedrdquo 887 ldquocanadiansrdquo ldquowhilerdquo ldquotheyrdquo 887ldquoyoungrdquo ldquocanadiansrdquo ldquowhilerdquo 887 ldquoshowcaserdquo ldquoofrdquoldquoworkrdquo 887 ldquoexcellentrdquo ldquoshowcaserdquo ldquoofrdquo 887 hellip

Here we see an array of common language funding informa-tion link information project information spread across over 887

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

websites Reading down we see information about governments jobexperience project information initiatives re-structuring and soforth At a glance we can see what language is common to all web-sites as opposed to unique information Moving down we find themore common but less universal ldquothe history ofrdquo for example ldquoher-itage community foundationrdquo libraries and so forth By the end ofthe list we can move into the unique or very rare phrases that appearonly once twice or three times

Moving beyond word and phrase frequency there are a fewmore sophisticated off-the-shelf options that can help us make senseof large arrays of unstructured data Topic modeling is among themost developed as Princetonrsquos David M Blei (a Computer Scienceprofessor) describes in a general introduction to the topic

While more and more texts are available online we simplydo not have the human power to read and study them toprovide the kind of [traditional] browsing experiencedescribed above To this end machine learning researchershave developed probabilistic topic modeling a suite of algo-rithms that aim to discover and annotate large archives ofdocuments with thematic information Topic modelingalgorithms are statistical methods that analyze the words ofthe original texts to discover the themes that run throughthem how those themes are connected to each other andhow they change over time hellip Topic modeling enables usto organize and summarize electronic archives at a scalethat would be impossible by human annotation55

Using the aforementioned SEASR Meandre workbench (or viaMALLET itself and a command-line interface) we can quickly beginto run our born-digital information through the topic-modelingalgorithm The results are notable and come in two formats a wordcloud visualization as well as an XML file containing a line-by-linebreakdown of each topic Taking the aforementioned file 267 (ldquopastand presentrdquo) we can begin to get a sense of the topics this large web-site discusses

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

At a glance we can learn about some of the main themes the struc-ture of Alberta (Alberta towns bridges lakes) the geographic shapeof the area (life land lands areas links grass and interestingly lan-guage) the stories (story knight archives but also interestinglyknight and wife) and so forth The word cloud gives us a sense ofthe most relevant terms that define each topic but we also have a3200-line XML file broken into topics which lets us further refineour inquiry

A last set of examples will help us move from the aggregate levelof hundreds of thousands of lines of text to the individual line inessence creating our own search engine to navigate this collection

There are a few good reasons why we should not use Google todo all of this in our own work First we do not know how Googleworks precisely It is a proprietary black box While the original algo-rithm is available the actual methods that go into determining theranking of search terms is a trade secret Second Google is optimizedfor a number of uses mainly to deliver the most relevant informa-tion to the user Historians and archivists are looking often forspecific information within old websites and it behooves us to exam-ine them through the lens of more specialized tools With thesehesitations in mind I have sought to develop my own rudimentarysearch engine for this born-digital collection

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A two-pronged approach can take us to relevant informationquickly In the following example I am carrying out a project onblack history We quickly search all of the CDC word frequency filesfor the occurrence of the word ldquofreedomrdquo ldquoblackrdquo and ldquolibertyrdquo(these terms of course require some background knowledge thesearch is shaped by the user) obtain a list and then map it onto achart in order to quickly visualize their locations in the collection Forexample in the graph above we see a visualization of a given term inthis case our three search terms As one moves the mouse over eachbar the frequency data for a given term appears We can also zoomand enhance this graph to use it as we wish A demonstration of thiscan be found online at wwwianmilliganca20130118cdc-tool-1

We can then move down into the data In the example aboveldquo047rdquo labeled ldquoblackgoldtxtrdquo contains 31 occurrences of the wordldquoontariordquo Moving to my Mathematica keyword-in-context routinewe can see where these occurrences of the word appear

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

A snapshot of the dynamic finding aid tool

We see that this website focuses mostly on southwestern Ontariowhere many oil fields are If we move forward we can see that it per-tains to the nineteenth century a particular region as well assubsequent memorialization activities (mention of a museum forexample) All this can be done from within the textual client asopposed to going out into individual websites Broad contoursemerge from this distant reading but they also facilitate deeper read-ing by allowing us to promptly realize critical sources

Another model through which we can visualize documents isthrough a viewer primarily focusing on term document frequencyand inverse frequency terms discussed below To prepare this mater-ial we are combining our earlier data about word frequency (that theword ldquoexperiencerdquo appears 919 times perhaps) with aggregate infor-mation about frequency in the entire corpus (ie the wordldquoexperiencerdquo appears 1766 times) The number of times that a termappears in a given website can be defined as tf (term frequency)Another opportunity is inverse document frequency where we canlearn what words are most distinct and unique in a given documentThe formula for this is the oft-cited tf-idf a staple of informationretrieval theory or tf times the logarithm of the number of docu-ments divided into the number of documents where the given termappears At a glance then we can see some of the unique words con-tained within a website if we select ldquoinverse document frequencyrdquo

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

At a very quick glance one can see the major words that make this aunique document This technique can be used on any textual docu-ment

Off-the-shelf tools can facilitate a deeper yet distant reading ofindividual websites as well especially the suite of research tools onwebsites such as Many Eyes and Voyant Using these tools howeverrequires that one puts individual data into each separate programrather than quickly navigating between them This stems from themassive amount of files and data involved Many Eyes limits files to5MB and Voyant has no hard and fast rule but can struggle withlarger files

This analysis was carried out on my personal computer55 Largerdata sets may necessitate a corresponding increase in computationalpower bringing social sciences and humanities researchers into con-tact with High Performance Computing (HPC) centres or consortiaThese extremely powerful computers provide researchers with com-puting power orders of magnitude above that of personal computersallowing results in hours instead of the weeks or months it might

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

take to process on a smaller system56 This has particular applicationin the basic processing of a very large web archive There are limita-tions however jobs must be submitted into queues in advancerather than on-site processing and perhaps most importantly HPCcentres prefer to run highly optimized code to minimize usage oftheir very expensive and scarcely allocated equipment As this level ofcoding expertise is often beyond humanities researchers researchteams must be formed

Humanities researchers would do well to familiarize themselveswith local HPC options many of which are more than happy towork on new and engaging questions These are accessible optionsin Ontario for example all but one university belongs to a HPCconsortium57 Yet as personal computing continues to dramaticallyincrease in speed and as storage prices continue to plummet textualanalysis is increasingly feasible on high-quality office systems An ini-tial exposure can generate questions and problems that if the needrears itself can subsequently be taken to the supercomputer

This case study is intended to provide one model for distantreading and the generation of finding aids in the interest of showingsome of the potential for working with born-digital sources It isintended to begin or jumpstart a conversation around the resourcesand skills necessary for the next generation A seemingly insur-mountable amount of information contained across over 73000files (as in the CDC collection) can be quickly navigated and iso-lated for researcher convenience

Conclusion Digital Literacy for the Next Generation ofHistorians

As a discipline I would like to suggest one path forward prepare thenext generation of historians at least some of them to deal with thedeluge of born-digital sources Fundamentally the challenge of born-digital sources is a pedagogical one For existing historians this is anissue of continual education for future historians an issue of profes-sional training (the annual Digital Humanities Summer Instituteheld at the University of Victoria each June is worth considering asan introduction to many digital concepts) Some off-the-shelf tools

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

can serve as gateways to deeper engagement notably the VoyantTools suite (wwwvoyant-toolsorg) and perhaps subsequently thelessons contained in the Programming Historian 2 That said digitalliteracy courses should become part of the essential pantheon of his-torical curricula This should not be mandatory but instead taughtalongside and within traditional historiography and research meth-ods courses In an era of diminishing resources and attention paid toknowledge translation and relevancy digital literacy courses are alsowell positioned to provide graduates with concrete portfolios andactionable skills If we want to be prepared to use digital method-ologies to study the post-1990 world we will need ready historians

Such a course should have six basic priorities First we must pro-vide historiographical context for new students on whatcomputational methods have been tried before How have quantita-tive and qualitative methods differed yet ultimately complementedeach other Second students in this course would need to develop anunderstanding of the digital basics cloud computing effectivebackup and versioning digital security and an understanding of howthe Internet is organized Third for effective cross-departmental andsustained engagement with digital history students must to learn howto digitally navigate conventional sources notable skill sets wouldinclude a familiarity with citation management software databasesand the possibility of using optical character recognition on a varietyof born-paper sources subsequently digitized through scanners orcameras This must also include a grasp of the scope and significanceof large digital depositories including Google Books the InternetArchive and others mentioned earlier Fourth students must learnhow to process large amounts of digital information this includes off-the-shelf visualization tools material discussed here and some of thephilosophical and methodological implications of a distance readingFifth students can learn basic programming through the ProgrammingHistorian Finally and perhaps most importantly students need anunderstanding of various ways to openly disseminate material on the web whether through blogging micro-blogging publishing webpages and so forth it is important to enable the great work that we do in the university to make its mark on the world Throughthis students will also come away from history departments with

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

comprehensive and marketable digital portfolios Such a coursewould focus more on homework than reading with special emphasisplaced on deliverable content such as websites interactive modelsand broadly implementable databases

Digital history and the readiness to work with born-digitalsources cannot stem out of pedagogy alone however Institutionalsupport is necessary While based on anecdotal evidence it appearsthat some university administrators have begun to embrace the shifttowards the digital humanities At the federal level SSHRC has iden-tified the Digital Economy as one of its priority areas58 Yet when itcomes to evaluation doctoral and post-doctoral researchers do nothave a ldquodigital humanitiesrdquo option open to them in terms of assessingcommittees leading to fears that their projects might be lost inbetween disciplinary areas of focus59 In light of the impending crisisof born-digital sources historians should encourage these initiatives

These sorts of digital projects do not always lend themselveswell to traditional publication venues The examples above from theCDC case study for example are best viewed as dynamically mal-leable models this will be one challenge as historians grapple withnew publication models This article argues first why digitalmethodologies are necessary and second how we can realize aspectsof this call The Internet is 20 years old and before historians realizeit we will be faced with the crisis of born-digital information over-load If history is to continue as the leading discipline inunderstanding the social and cultural past decisive movementtowards the digital humanities is necessary When the sea comes inwe must be ready to swim

IAN MILLIGAN is an assistant professor of Canadian history at theUniversity of Waterloo He is a founding co-editor of ActiveHistorycaan editor-at-large of the Programming Historian 2 and has written sev-eral articles in the fields of Canadian youth digital and labour history

IAN MILLIGAN est professeur adjoint en histoire canadienne agrave laUniversity of Waterloo Il est le fondateur et codirecteur du site

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Internet ActiveHistoryca ainsi qursquoun eacutediteur du site ProgrammingHistorian 2 Il a eacutecrit plusieurs articles au sujet de la jeunesse cana-dienne de la numeacuterisation et de lrsquohistoire des travailleurs

Endnotes

1 The question of when the past becomes a ldquohistoryrdquo to be professionallystudied is a tricky one Yet it is worth noting that the still-definitivework on the Baby Boom Generation appeared in 1996 with DougOwram Born at the Right Time A History of the Baby Boom Generationin Canada (Toronto University of Toronto Press 1996) This actuallyfollowed the earlier scholarly monograph Cyril Levitt Children ofPrivilege Student Revolt in the Sixties (Toronto University of TorontoPress 1984) and a host of other laudable works As academicdebatesworks on the 1960s now suggest the past becomes historysooner than we often expect

2 A critical point advanced in the field of literary criticism and theory byStephen Ramsay Reading Machines Towards an Algorithmic Criticism(Urbana University of Illinois Press 2011)

3 See ldquoIdleNoMore Timelinerdquo ongoingwwwidlenomoremakookcatimeline44 ltviewed 18 January 2013gt

4 These issues have been discussed elsewhere albeit in either a moreabridged form or with a focus on digitized historical sources A notableexception is Daniel J Cohen and Roy Rosenzweig Digital History AGuide to Gathering Preserving and Presenting the Past on the Web(Philadelphia University of Pennsylvania Press 2005) specifically theirchapter on ldquoCollecting History onlinerdquo Their focus is on the preservation-ist and archival challenges however Important notes are also found inDan Cohen ldquoHistory and the Second Decade of the Webrdquo RethinkingHistory 8 no 2 (June 2004) 293ndash301 also available online atwwwchnmgmueduessays-on-history-new-mediaessaysessayid=34

5 Figure taken from Jose Martinez ldquoTwitter CEO Dick Costolo Reveals Staggering Number of Tweets Per Dayrdquo ComplexTech wwwcomplexcomtech201210twitter-ceo-dick-costolo-reveals-staggering-number-of-tweets-per-day ltviewed 6 January 2013gtFurther more while a tweet itself is only 140 characters every tweetcomes with over 40 lines of metadata date information author biogra-phy and location information when the account was created the usersrsquofollowing and follower numbers country data and so forth This is aremarkable treasure trove of information for future historians but also apotential source of information overload See Sarah Perez ldquoThis is What

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

a Tweet Looks Likerdquo ReadWriteWeb (19 April 2010) wwwreadwritewebcomarchivesthis_is_what_a_tweet_looks_likephpltviewed 6 January 2013gt

6 This is further discussed in Ian Foster ldquoHow Computation ChangesResearchrdquo in Switching Codes Thinking Through Digital Technology inthe Humanities and the Arts eds Thomas Bartscherer and RoderickCoover (Chicago University of Chicago Press 2011) esp 22ndash4

7 Franco Moretti Graphs Maps Trees Abstract Models for Literary History(New York Verso 2005) 3ndash4

8 While not of prime focus here the longue dureacutee deserves definition Itstands opposite the history of events of instances covering instead avery long time span in an interdisciplinary social science framework In along essay Braudel noted that one can then see both structural crises ofeconomies and structures that constrain human society and develop-ment For more see Fernand Braudel ldquoHistory and the Social SciencesThe Longue Dureacuteerdquo in Fernand Braudel On History trans SarahMatthews (Chicago University of Chicago Press 1980) 27 The essaywas originally printed in the Annales ESC no 4 (OctoberndashDecember1958)

9 A great overview can be found at ldquoText Analysisrdquo Tooling Up for DigitalHumanities 2011 wwwtoolingupstanfordedupage_id=981 ltviewed19 December 2012gt

10 Jean-Baptiste Michel Yuan Kui Shen Aviva Presser Aiden Adrian VeresMatthew K Gray The Google Books Team Joseph P Pickett DaleHoiberg Dan Clancy Peter Norvig Jon Orwant Steven Pinker MartinA Nowak Erez Lieberman Aiden ldquoQuantitative Analysis of CultureUsing Millions of Digitized Booksrdquo Science 331 (14 January 2011)176ndash82 The tool is publicly accessible at ldquoGoogle Books NgramViewerrdquo Google Books 2010 wwwbooksgooglecomngrams ltviewed11 January 2013gt

11 Anthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

12 Comment by Jean-Baptiste Michel and Erez Lieberman Aiden onAnthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

13 For more information see the Canadian Century Research InfrastructureProject website at wwwccriuottawacaCCRIHomehtml and the Programme de recherche en deacutemographie historique at wwwgeneologyumontrealcafrleprdhhtm

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

14 ldquoWith Criminal Intentrdquo CriminalIntentorg wwwcriminalintentorgltviewed 20 December 2012gt

15 For an overview see ldquoAward Recipients mdash 2011rdquo Digging into DataWebpage wwwdiggingintodataorgHomeAwardRecipients2011tabid185Defaultaspx ltviewed 20 December 2012gt

16 One can do no better than the completely free Programming Historianwhich uses entirely free and open-source programming tools to learn thebasics of computational history See The Programming Historian 2 2012-Present wwwprogramminghistorianorg ltviewed 11 January 2013gt

17 Michael Katz The People of Hamilton Canada West Family and Class ina Mid-Nineteenth-Century City (Cambridge MA Harvard UniversityPress 1975) Also see A Gordon Darroch and Michael D OrnsteinldquoEthnicity and Occupational Structure in Canada in 1871 The VerticalMosaic in Historical Perspectiverdquo Canadian Historical Review 61 no 3(1980) 305ndash33

18 Ian Anderson ldquoHistory and Computingrdquo Making History The ChangingFace of the Profession in Britain 2008 wwwhistoryacukmakinghistoryresourcesarticleshistory_and_computinghtml ltviewed 20 December2012gt

19 Christopher Collins ldquoText and Document Visualizationrdquo TheHumanities and Technology Camp Greater Toronto Area VisualizationBootcamp University of Ontario Institute of Technology Oshawa Ont21 October 2011

20 The quotations by Luther and Poe are commonly used to make thispoint For the best example see Clay Shirky Cognitive SurplusCreativity and Generosity in a Connected Age Google eBook Penguin2010 For Mumford see Lewis Mumford The Myth of the Machine vol2 The Pentagon of Power (New York Harcourt Brace 1970) 182 asquoted in Gleick The Information A History A Theory A Flood (NewYork Pantheon 2011) 404

21 Adam Ostrow ldquoCisco CRS-3 Download Library of Congress in JustOver One Secondrdquo Mashable (9 March 2010) accessed onlinewwwmashablecom20100309cisco-crs-3

22 A note on data size is necessary The basic unit of information is a ldquobyterdquo (8 bits) which is roughly a single character A kilobyte (KB) is athousand bytes a megabyte (MB) a million a gigabyte (GB) a billion aterabyte (TB) a trillion and a petabyte (PB) a quadrillion Beyond thisthere are exabytes (EB) zettabytes (ZB) and yottabytes (YB)

23 These figures can of course vary depending on compression methodschosen Their math was based on an average of 8MB per book if it wasscanned at 600dpi and then compressed There are 26 million books inthe Library of Congress For more see Mike Ashenfelder ldquoTransferringlsquoLibraries of Congressrsquo of Datardquo Library of Congress Digital Preservation

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Blog 11 July 2011wwwblogslocgovdigitalpreservation201107transferring-libraries-of-congress-of-data ltviewed 11 January 2013gt

24 Gleick 232 25 ldquoAbout the Library of Congress Web Archivesrdquo Library of Congress Web

Archiving FAQs Undated wwwlocgovwebarchivingfaqhtmlfaqs_02ltviewed 11 January 2013gt

26 ldquo10000000000000000 Bytes Archivedrdquo Internet Archive Blog 26October 2012 wwwblogarchiveorg2012102610000000000000000-bytes-archivedltviewed 19 December 2012gt

27 Jean-Baptiste Michel et al 176ndash82 Published Online Ahead of Print16 December 2010 wwwsciencemagorgcontent3316014176ltviewed 29 July 2011gt See also Scott Cleland ldquoGooglersquos PiracyLiabilityrdquo Forbes 9 November 2011 wwwforbescomsitesscottcle-land20111109googles-piracy-liability ltviewed 11 January 2013gt

28 ldquoGovernment of Canada Web Archiverdquo Library and Archives Canada 17October 2007 wwwcollectionscanadagccawebarchivesindex-ehtmlltviewed 11 January 2013gt

29 For background see Bill Curry ldquoVisiting Library and Archives inOttawa Not Without an Appointmentrdquo Globe and Mail 1 May 2012wwwtheglobeandmailcomnewspoliticsottawa-notebookvisiting-library-and-archives-in-ottawa-not-without-an-appointmentarticle2418960 ltviewed 22 May 2012gt and the announcement ldquoLAC Begins Implementation of New Approach to service Deliveryrdquo Library and Archives Canada February 2012wwwcollectionscanadagccawhats-new013-560-ehtml ltviewed 22May 2012gt For professional responses see the ldquoCHArsquos comment onNew Directions for Library and Archives Canadardquo CHA Website March2012 wwwcha-shccaenNews_39items19html ltviewed 19December 2012gt and ldquoSave Library amp Archives Canadardquo CanadianAssociation of University Teachers launched AprilMay 2012 wwwsavelibraryarchivesca ltviewed 19 December 2012gt

30 Ian Milligan ldquoThe Smokescreen of lsquoModernizationrsquo at Library andArchives Canadardquo ActiveHistoryca 22 May 2012wwwactivehistoryca201205the-smokescreen-of-modernization-at-library-and-archives-canada ltviewed 19 December 2012gt

31 ldquoSave LAC May 2012 Campaign Updaterdquo SaveLibraryArchivesca May2012 wwwsavelibraryarchivescaupdate-2012-05aspx ltviewed 19December 2012gt

32 The challenges and opportunities presented by such projects is extensively discussed in Daniel J Cohen ldquoThe Future of Preserving the Pastrdquo CRM The Journal of Heritage Stewardship 2 no 2

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

(Summer 2005) 6ndash19 also available freely online at the RoyRosenzweig Center for History and New Mediawwwchnmgmueduessays-on-history-new-mediaessaysessayid=39For more see ldquoSeptember 11 Digital Archive Saving the Histories of September 11 2001rdquo Centre for History and New Media andAmerican Social History Project Center for Media and Learningwww911digitalarchiveorg ltviewed 11 January 2013gt ldquoHurricaneDigital Memory Bank Collecting and Preserving the Stories of Katrina and Ritardquo Center for History and New Media wwwhurricanearchiveorg ltviewed 11 January 2013gt and ldquoOccupyArchive Archiving the Occupy Movements from 2011rdquo Centre forHistory and New Media wwwoccupyarchiveorg ltviewed 11 January2013gt

33 Daniel Cohen and Roy Rosenzweig Digital History A Guide to Gathering Preserving and Presenting the Past on the Web(Philadelphia Temple University Press 2005) chapter on ldquoPreservingDigital Historyrdquo The chapter is available fully online atwwwchnmgmuedudigitalhistorypreserving5php ltviewed 19December 2012gt

34 For an overview see ldquoPreserving Our Digital Heritage The NationalDigital Information Infrastructure and Preservation Program 2010 Report a Collaborative Initiative of the Library of CongressrdquoDigitalpreservationgov wwwdigitalpreservationgovmultimediadocumentsNDIIPP2010Report_Postpdf ltviewed 19 December 2012gt

35 Steve Meloan ldquoNo Way to Run a Culturerdquo Wired 13 February 1998available online via the Internet Archivewwwwebarchiveorgweb20000619001705httpwwwwiredcomnewsculture012841030100html ltviewed 19 December 2012gt

36 Some of this has been discussed in greater depth on my website See theseries beginning with Ian Milligan ldquoWARC Files A Challenge forHistorians and Finding Needles in Haystacksrdquo ianmilliganca 12December 2012 wwwianmilliganca20121212warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks ltviewed 20 December2012gt

37 Unstructured data as opposed to structured data is important for histo-rians but brings additional challenges Some projects have producedstructured data novels marked up using XML encoding for example soone can immediately see place names dates character names and soforth This makes it much easier for a computer to read and understand

38 See William Noblett ldquoDigitization A Cautionary Talerdquo New Review ofAcademic Librarianship 17 (2011) 3 Tobias Blanke Michael Bryantand Mark Hedges ldquoOcropodium Open Source OCR for Small-ScaleHistorical Archivesrdquo Journal of Information Science 38 no 1 (2012) 83

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 29: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

relevant information This can be useful but we need to moveback even further to achieve a distant reading of this born-digitalcollection

If one zooms back out to this very distant level we can extractn-gram data from the entire collection We have two approachesfirst aggregate appearances of a given term (if ldquoaboriginalrdquo appears457 times then n=457) or second relative appearances of a giventerm (if ldquoaboriginalrdquo appears in 45 of 384 sites then n=45384 or01171875) In my other work on music lyrics the latter optionenables you to control for the lsquochorusrsquo effect as the former can be drowned out by one song with repetitive words Given thenature of this collection I elected for the former approach Whatcan we learn from trigrams for example After an initial set of trigrams pertaining to the Louisbourg digitization project whichhas boilerplate language on each page as well as common ldquostop-wordrdquo equivalents of phrases (especially pronounced in French) we come across the first set of meaningful phrases excerptedbelow

ldquoexperiencerdquo ldquoinrdquo ldquotherdquo 919 ldquotransferredrdquo ldquotordquo ldquotherdquo918 ldquofundedrdquo ldquobyrdquo ldquotherdquo 918 ldquoofrdquo ldquotherdquo ldquoprovincialrdquo915 ldquothisrdquo ldquowebrdquo ldquositerdquo 915 ldquo409rdquo ldquoregistryrdquo ldquo2rdquo914 ldquoarerdquo ldquonordquo ldquolongerrdquo 914 ldquogouvernementrdquo ldquodurdquo ldquocanadardquo 910 ldquouserdquo ldquothisrdquo ldquolinkrdquo 887ldquopagerdquo ldquoorrdquo ldquoyourdquo 887 ldquofollowingrdquo ldquopagerdquo ldquoorrdquo 887 ldquoautomaticallyrdquo ldquotransferredrdquo ldquotordquo 887 ldquoberdquoldquoautomaticallyrdquo ldquotransferredrdquo 887 ldquowillrdquo ldquoberdquo ldquoautomaticallyrdquo 887 ldquoindustryrdquo ldquoyourdquo ldquowillrdquo 887ldquomultimediardquo ldquoindustryrdquo ldquoyourdquo 887 ldquopracticalrdquo ldquoexperiencerdquo ldquoinrdquo 887 ldquogainedrdquo ldquopracticalrdquo ldquoexperiencerdquo887 ldquotheyrdquo ldquogainedrdquo ldquopracticalrdquo 887 ldquowhilerdquo ldquotheyrdquoldquogainedrdquo 887 ldquocanadiansrdquo ldquowhilerdquo ldquotheyrdquo 887ldquoyoungrdquo ldquocanadiansrdquo ldquowhilerdquo 887 ldquoshowcaserdquo ldquoofrdquoldquoworkrdquo 887 ldquoexcellentrdquo ldquoshowcaserdquo ldquoofrdquo 887 hellip

Here we see an array of common language funding informa-tion link information project information spread across over 887

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

websites Reading down we see information about governments jobexperience project information initiatives re-structuring and soforth At a glance we can see what language is common to all web-sites as opposed to unique information Moving down we find themore common but less universal ldquothe history ofrdquo for example ldquoher-itage community foundationrdquo libraries and so forth By the end ofthe list we can move into the unique or very rare phrases that appearonly once twice or three times

Moving beyond word and phrase frequency there are a fewmore sophisticated off-the-shelf options that can help us make senseof large arrays of unstructured data Topic modeling is among themost developed as Princetonrsquos David M Blei (a Computer Scienceprofessor) describes in a general introduction to the topic

While more and more texts are available online we simplydo not have the human power to read and study them toprovide the kind of [traditional] browsing experiencedescribed above To this end machine learning researchershave developed probabilistic topic modeling a suite of algo-rithms that aim to discover and annotate large archives ofdocuments with thematic information Topic modelingalgorithms are statistical methods that analyze the words ofthe original texts to discover the themes that run throughthem how those themes are connected to each other andhow they change over time hellip Topic modeling enables usto organize and summarize electronic archives at a scalethat would be impossible by human annotation55

Using the aforementioned SEASR Meandre workbench (or viaMALLET itself and a command-line interface) we can quickly beginto run our born-digital information through the topic-modelingalgorithm The results are notable and come in two formats a wordcloud visualization as well as an XML file containing a line-by-linebreakdown of each topic Taking the aforementioned file 267 (ldquopastand presentrdquo) we can begin to get a sense of the topics this large web-site discusses

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

At a glance we can learn about some of the main themes the struc-ture of Alberta (Alberta towns bridges lakes) the geographic shapeof the area (life land lands areas links grass and interestingly lan-guage) the stories (story knight archives but also interestinglyknight and wife) and so forth The word cloud gives us a sense ofthe most relevant terms that define each topic but we also have a3200-line XML file broken into topics which lets us further refineour inquiry

A last set of examples will help us move from the aggregate levelof hundreds of thousands of lines of text to the individual line inessence creating our own search engine to navigate this collection

There are a few good reasons why we should not use Google todo all of this in our own work First we do not know how Googleworks precisely It is a proprietary black box While the original algo-rithm is available the actual methods that go into determining theranking of search terms is a trade secret Second Google is optimizedfor a number of uses mainly to deliver the most relevant informa-tion to the user Historians and archivists are looking often forspecific information within old websites and it behooves us to exam-ine them through the lens of more specialized tools With thesehesitations in mind I have sought to develop my own rudimentarysearch engine for this born-digital collection

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A two-pronged approach can take us to relevant informationquickly In the following example I am carrying out a project onblack history We quickly search all of the CDC word frequency filesfor the occurrence of the word ldquofreedomrdquo ldquoblackrdquo and ldquolibertyrdquo(these terms of course require some background knowledge thesearch is shaped by the user) obtain a list and then map it onto achart in order to quickly visualize their locations in the collection Forexample in the graph above we see a visualization of a given term inthis case our three search terms As one moves the mouse over eachbar the frequency data for a given term appears We can also zoomand enhance this graph to use it as we wish A demonstration of thiscan be found online at wwwianmilliganca20130118cdc-tool-1

We can then move down into the data In the example aboveldquo047rdquo labeled ldquoblackgoldtxtrdquo contains 31 occurrences of the wordldquoontariordquo Moving to my Mathematica keyword-in-context routinewe can see where these occurrences of the word appear

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

A snapshot of the dynamic finding aid tool

We see that this website focuses mostly on southwestern Ontariowhere many oil fields are If we move forward we can see that it per-tains to the nineteenth century a particular region as well assubsequent memorialization activities (mention of a museum forexample) All this can be done from within the textual client asopposed to going out into individual websites Broad contoursemerge from this distant reading but they also facilitate deeper read-ing by allowing us to promptly realize critical sources

Another model through which we can visualize documents isthrough a viewer primarily focusing on term document frequencyand inverse frequency terms discussed below To prepare this mater-ial we are combining our earlier data about word frequency (that theword ldquoexperiencerdquo appears 919 times perhaps) with aggregate infor-mation about frequency in the entire corpus (ie the wordldquoexperiencerdquo appears 1766 times) The number of times that a termappears in a given website can be defined as tf (term frequency)Another opportunity is inverse document frequency where we canlearn what words are most distinct and unique in a given documentThe formula for this is the oft-cited tf-idf a staple of informationretrieval theory or tf times the logarithm of the number of docu-ments divided into the number of documents where the given termappears At a glance then we can see some of the unique words con-tained within a website if we select ldquoinverse document frequencyrdquo

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

At a very quick glance one can see the major words that make this aunique document This technique can be used on any textual docu-ment

Off-the-shelf tools can facilitate a deeper yet distant reading ofindividual websites as well especially the suite of research tools onwebsites such as Many Eyes and Voyant Using these tools howeverrequires that one puts individual data into each separate programrather than quickly navigating between them This stems from themassive amount of files and data involved Many Eyes limits files to5MB and Voyant has no hard and fast rule but can struggle withlarger files

This analysis was carried out on my personal computer55 Largerdata sets may necessitate a corresponding increase in computationalpower bringing social sciences and humanities researchers into con-tact with High Performance Computing (HPC) centres or consortiaThese extremely powerful computers provide researchers with com-puting power orders of magnitude above that of personal computersallowing results in hours instead of the weeks or months it might

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

take to process on a smaller system56 This has particular applicationin the basic processing of a very large web archive There are limita-tions however jobs must be submitted into queues in advancerather than on-site processing and perhaps most importantly HPCcentres prefer to run highly optimized code to minimize usage oftheir very expensive and scarcely allocated equipment As this level ofcoding expertise is often beyond humanities researchers researchteams must be formed

Humanities researchers would do well to familiarize themselveswith local HPC options many of which are more than happy towork on new and engaging questions These are accessible optionsin Ontario for example all but one university belongs to a HPCconsortium57 Yet as personal computing continues to dramaticallyincrease in speed and as storage prices continue to plummet textualanalysis is increasingly feasible on high-quality office systems An ini-tial exposure can generate questions and problems that if the needrears itself can subsequently be taken to the supercomputer

This case study is intended to provide one model for distantreading and the generation of finding aids in the interest of showingsome of the potential for working with born-digital sources It isintended to begin or jumpstart a conversation around the resourcesand skills necessary for the next generation A seemingly insur-mountable amount of information contained across over 73000files (as in the CDC collection) can be quickly navigated and iso-lated for researcher convenience

Conclusion Digital Literacy for the Next Generation ofHistorians

As a discipline I would like to suggest one path forward prepare thenext generation of historians at least some of them to deal with thedeluge of born-digital sources Fundamentally the challenge of born-digital sources is a pedagogical one For existing historians this is anissue of continual education for future historians an issue of profes-sional training (the annual Digital Humanities Summer Instituteheld at the University of Victoria each June is worth considering asan introduction to many digital concepts) Some off-the-shelf tools

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

can serve as gateways to deeper engagement notably the VoyantTools suite (wwwvoyant-toolsorg) and perhaps subsequently thelessons contained in the Programming Historian 2 That said digitalliteracy courses should become part of the essential pantheon of his-torical curricula This should not be mandatory but instead taughtalongside and within traditional historiography and research meth-ods courses In an era of diminishing resources and attention paid toknowledge translation and relevancy digital literacy courses are alsowell positioned to provide graduates with concrete portfolios andactionable skills If we want to be prepared to use digital method-ologies to study the post-1990 world we will need ready historians

Such a course should have six basic priorities First we must pro-vide historiographical context for new students on whatcomputational methods have been tried before How have quantita-tive and qualitative methods differed yet ultimately complementedeach other Second students in this course would need to develop anunderstanding of the digital basics cloud computing effectivebackup and versioning digital security and an understanding of howthe Internet is organized Third for effective cross-departmental andsustained engagement with digital history students must to learn howto digitally navigate conventional sources notable skill sets wouldinclude a familiarity with citation management software databasesand the possibility of using optical character recognition on a varietyof born-paper sources subsequently digitized through scanners orcameras This must also include a grasp of the scope and significanceof large digital depositories including Google Books the InternetArchive and others mentioned earlier Fourth students must learnhow to process large amounts of digital information this includes off-the-shelf visualization tools material discussed here and some of thephilosophical and methodological implications of a distance readingFifth students can learn basic programming through the ProgrammingHistorian Finally and perhaps most importantly students need anunderstanding of various ways to openly disseminate material on the web whether through blogging micro-blogging publishing webpages and so forth it is important to enable the great work that we do in the university to make its mark on the world Throughthis students will also come away from history departments with

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

comprehensive and marketable digital portfolios Such a coursewould focus more on homework than reading with special emphasisplaced on deliverable content such as websites interactive modelsand broadly implementable databases

Digital history and the readiness to work with born-digitalsources cannot stem out of pedagogy alone however Institutionalsupport is necessary While based on anecdotal evidence it appearsthat some university administrators have begun to embrace the shifttowards the digital humanities At the federal level SSHRC has iden-tified the Digital Economy as one of its priority areas58 Yet when itcomes to evaluation doctoral and post-doctoral researchers do nothave a ldquodigital humanitiesrdquo option open to them in terms of assessingcommittees leading to fears that their projects might be lost inbetween disciplinary areas of focus59 In light of the impending crisisof born-digital sources historians should encourage these initiatives

These sorts of digital projects do not always lend themselveswell to traditional publication venues The examples above from theCDC case study for example are best viewed as dynamically mal-leable models this will be one challenge as historians grapple withnew publication models This article argues first why digitalmethodologies are necessary and second how we can realize aspectsof this call The Internet is 20 years old and before historians realizeit we will be faced with the crisis of born-digital information over-load If history is to continue as the leading discipline inunderstanding the social and cultural past decisive movementtowards the digital humanities is necessary When the sea comes inwe must be ready to swim

IAN MILLIGAN is an assistant professor of Canadian history at theUniversity of Waterloo He is a founding co-editor of ActiveHistorycaan editor-at-large of the Programming Historian 2 and has written sev-eral articles in the fields of Canadian youth digital and labour history

IAN MILLIGAN est professeur adjoint en histoire canadienne agrave laUniversity of Waterloo Il est le fondateur et codirecteur du site

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Internet ActiveHistoryca ainsi qursquoun eacutediteur du site ProgrammingHistorian 2 Il a eacutecrit plusieurs articles au sujet de la jeunesse cana-dienne de la numeacuterisation et de lrsquohistoire des travailleurs

Endnotes

1 The question of when the past becomes a ldquohistoryrdquo to be professionallystudied is a tricky one Yet it is worth noting that the still-definitivework on the Baby Boom Generation appeared in 1996 with DougOwram Born at the Right Time A History of the Baby Boom Generationin Canada (Toronto University of Toronto Press 1996) This actuallyfollowed the earlier scholarly monograph Cyril Levitt Children ofPrivilege Student Revolt in the Sixties (Toronto University of TorontoPress 1984) and a host of other laudable works As academicdebatesworks on the 1960s now suggest the past becomes historysooner than we often expect

2 A critical point advanced in the field of literary criticism and theory byStephen Ramsay Reading Machines Towards an Algorithmic Criticism(Urbana University of Illinois Press 2011)

3 See ldquoIdleNoMore Timelinerdquo ongoingwwwidlenomoremakookcatimeline44 ltviewed 18 January 2013gt

4 These issues have been discussed elsewhere albeit in either a moreabridged form or with a focus on digitized historical sources A notableexception is Daniel J Cohen and Roy Rosenzweig Digital History AGuide to Gathering Preserving and Presenting the Past on the Web(Philadelphia University of Pennsylvania Press 2005) specifically theirchapter on ldquoCollecting History onlinerdquo Their focus is on the preservation-ist and archival challenges however Important notes are also found inDan Cohen ldquoHistory and the Second Decade of the Webrdquo RethinkingHistory 8 no 2 (June 2004) 293ndash301 also available online atwwwchnmgmueduessays-on-history-new-mediaessaysessayid=34

5 Figure taken from Jose Martinez ldquoTwitter CEO Dick Costolo Reveals Staggering Number of Tweets Per Dayrdquo ComplexTech wwwcomplexcomtech201210twitter-ceo-dick-costolo-reveals-staggering-number-of-tweets-per-day ltviewed 6 January 2013gtFurther more while a tweet itself is only 140 characters every tweetcomes with over 40 lines of metadata date information author biogra-phy and location information when the account was created the usersrsquofollowing and follower numbers country data and so forth This is aremarkable treasure trove of information for future historians but also apotential source of information overload See Sarah Perez ldquoThis is What

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

a Tweet Looks Likerdquo ReadWriteWeb (19 April 2010) wwwreadwritewebcomarchivesthis_is_what_a_tweet_looks_likephpltviewed 6 January 2013gt

6 This is further discussed in Ian Foster ldquoHow Computation ChangesResearchrdquo in Switching Codes Thinking Through Digital Technology inthe Humanities and the Arts eds Thomas Bartscherer and RoderickCoover (Chicago University of Chicago Press 2011) esp 22ndash4

7 Franco Moretti Graphs Maps Trees Abstract Models for Literary History(New York Verso 2005) 3ndash4

8 While not of prime focus here the longue dureacutee deserves definition Itstands opposite the history of events of instances covering instead avery long time span in an interdisciplinary social science framework In along essay Braudel noted that one can then see both structural crises ofeconomies and structures that constrain human society and develop-ment For more see Fernand Braudel ldquoHistory and the Social SciencesThe Longue Dureacuteerdquo in Fernand Braudel On History trans SarahMatthews (Chicago University of Chicago Press 1980) 27 The essaywas originally printed in the Annales ESC no 4 (OctoberndashDecember1958)

9 A great overview can be found at ldquoText Analysisrdquo Tooling Up for DigitalHumanities 2011 wwwtoolingupstanfordedupage_id=981 ltviewed19 December 2012gt

10 Jean-Baptiste Michel Yuan Kui Shen Aviva Presser Aiden Adrian VeresMatthew K Gray The Google Books Team Joseph P Pickett DaleHoiberg Dan Clancy Peter Norvig Jon Orwant Steven Pinker MartinA Nowak Erez Lieberman Aiden ldquoQuantitative Analysis of CultureUsing Millions of Digitized Booksrdquo Science 331 (14 January 2011)176ndash82 The tool is publicly accessible at ldquoGoogle Books NgramViewerrdquo Google Books 2010 wwwbooksgooglecomngrams ltviewed11 January 2013gt

11 Anthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

12 Comment by Jean-Baptiste Michel and Erez Lieberman Aiden onAnthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

13 For more information see the Canadian Century Research InfrastructureProject website at wwwccriuottawacaCCRIHomehtml and the Programme de recherche en deacutemographie historique at wwwgeneologyumontrealcafrleprdhhtm

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

14 ldquoWith Criminal Intentrdquo CriminalIntentorg wwwcriminalintentorgltviewed 20 December 2012gt

15 For an overview see ldquoAward Recipients mdash 2011rdquo Digging into DataWebpage wwwdiggingintodataorgHomeAwardRecipients2011tabid185Defaultaspx ltviewed 20 December 2012gt

16 One can do no better than the completely free Programming Historianwhich uses entirely free and open-source programming tools to learn thebasics of computational history See The Programming Historian 2 2012-Present wwwprogramminghistorianorg ltviewed 11 January 2013gt

17 Michael Katz The People of Hamilton Canada West Family and Class ina Mid-Nineteenth-Century City (Cambridge MA Harvard UniversityPress 1975) Also see A Gordon Darroch and Michael D OrnsteinldquoEthnicity and Occupational Structure in Canada in 1871 The VerticalMosaic in Historical Perspectiverdquo Canadian Historical Review 61 no 3(1980) 305ndash33

18 Ian Anderson ldquoHistory and Computingrdquo Making History The ChangingFace of the Profession in Britain 2008 wwwhistoryacukmakinghistoryresourcesarticleshistory_and_computinghtml ltviewed 20 December2012gt

19 Christopher Collins ldquoText and Document Visualizationrdquo TheHumanities and Technology Camp Greater Toronto Area VisualizationBootcamp University of Ontario Institute of Technology Oshawa Ont21 October 2011

20 The quotations by Luther and Poe are commonly used to make thispoint For the best example see Clay Shirky Cognitive SurplusCreativity and Generosity in a Connected Age Google eBook Penguin2010 For Mumford see Lewis Mumford The Myth of the Machine vol2 The Pentagon of Power (New York Harcourt Brace 1970) 182 asquoted in Gleick The Information A History A Theory A Flood (NewYork Pantheon 2011) 404

21 Adam Ostrow ldquoCisco CRS-3 Download Library of Congress in JustOver One Secondrdquo Mashable (9 March 2010) accessed onlinewwwmashablecom20100309cisco-crs-3

22 A note on data size is necessary The basic unit of information is a ldquobyterdquo (8 bits) which is roughly a single character A kilobyte (KB) is athousand bytes a megabyte (MB) a million a gigabyte (GB) a billion aterabyte (TB) a trillion and a petabyte (PB) a quadrillion Beyond thisthere are exabytes (EB) zettabytes (ZB) and yottabytes (YB)

23 These figures can of course vary depending on compression methodschosen Their math was based on an average of 8MB per book if it wasscanned at 600dpi and then compressed There are 26 million books inthe Library of Congress For more see Mike Ashenfelder ldquoTransferringlsquoLibraries of Congressrsquo of Datardquo Library of Congress Digital Preservation

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Blog 11 July 2011wwwblogslocgovdigitalpreservation201107transferring-libraries-of-congress-of-data ltviewed 11 January 2013gt

24 Gleick 232 25 ldquoAbout the Library of Congress Web Archivesrdquo Library of Congress Web

Archiving FAQs Undated wwwlocgovwebarchivingfaqhtmlfaqs_02ltviewed 11 January 2013gt

26 ldquo10000000000000000 Bytes Archivedrdquo Internet Archive Blog 26October 2012 wwwblogarchiveorg2012102610000000000000000-bytes-archivedltviewed 19 December 2012gt

27 Jean-Baptiste Michel et al 176ndash82 Published Online Ahead of Print16 December 2010 wwwsciencemagorgcontent3316014176ltviewed 29 July 2011gt See also Scott Cleland ldquoGooglersquos PiracyLiabilityrdquo Forbes 9 November 2011 wwwforbescomsitesscottcle-land20111109googles-piracy-liability ltviewed 11 January 2013gt

28 ldquoGovernment of Canada Web Archiverdquo Library and Archives Canada 17October 2007 wwwcollectionscanadagccawebarchivesindex-ehtmlltviewed 11 January 2013gt

29 For background see Bill Curry ldquoVisiting Library and Archives inOttawa Not Without an Appointmentrdquo Globe and Mail 1 May 2012wwwtheglobeandmailcomnewspoliticsottawa-notebookvisiting-library-and-archives-in-ottawa-not-without-an-appointmentarticle2418960 ltviewed 22 May 2012gt and the announcement ldquoLAC Begins Implementation of New Approach to service Deliveryrdquo Library and Archives Canada February 2012wwwcollectionscanadagccawhats-new013-560-ehtml ltviewed 22May 2012gt For professional responses see the ldquoCHArsquos comment onNew Directions for Library and Archives Canadardquo CHA Website March2012 wwwcha-shccaenNews_39items19html ltviewed 19December 2012gt and ldquoSave Library amp Archives Canadardquo CanadianAssociation of University Teachers launched AprilMay 2012 wwwsavelibraryarchivesca ltviewed 19 December 2012gt

30 Ian Milligan ldquoThe Smokescreen of lsquoModernizationrsquo at Library andArchives Canadardquo ActiveHistoryca 22 May 2012wwwactivehistoryca201205the-smokescreen-of-modernization-at-library-and-archives-canada ltviewed 19 December 2012gt

31 ldquoSave LAC May 2012 Campaign Updaterdquo SaveLibraryArchivesca May2012 wwwsavelibraryarchivescaupdate-2012-05aspx ltviewed 19December 2012gt

32 The challenges and opportunities presented by such projects is extensively discussed in Daniel J Cohen ldquoThe Future of Preserving the Pastrdquo CRM The Journal of Heritage Stewardship 2 no 2

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

(Summer 2005) 6ndash19 also available freely online at the RoyRosenzweig Center for History and New Mediawwwchnmgmueduessays-on-history-new-mediaessaysessayid=39For more see ldquoSeptember 11 Digital Archive Saving the Histories of September 11 2001rdquo Centre for History and New Media andAmerican Social History Project Center for Media and Learningwww911digitalarchiveorg ltviewed 11 January 2013gt ldquoHurricaneDigital Memory Bank Collecting and Preserving the Stories of Katrina and Ritardquo Center for History and New Media wwwhurricanearchiveorg ltviewed 11 January 2013gt and ldquoOccupyArchive Archiving the Occupy Movements from 2011rdquo Centre forHistory and New Media wwwoccupyarchiveorg ltviewed 11 January2013gt

33 Daniel Cohen and Roy Rosenzweig Digital History A Guide to Gathering Preserving and Presenting the Past on the Web(Philadelphia Temple University Press 2005) chapter on ldquoPreservingDigital Historyrdquo The chapter is available fully online atwwwchnmgmuedudigitalhistorypreserving5php ltviewed 19December 2012gt

34 For an overview see ldquoPreserving Our Digital Heritage The NationalDigital Information Infrastructure and Preservation Program 2010 Report a Collaborative Initiative of the Library of CongressrdquoDigitalpreservationgov wwwdigitalpreservationgovmultimediadocumentsNDIIPP2010Report_Postpdf ltviewed 19 December 2012gt

35 Steve Meloan ldquoNo Way to Run a Culturerdquo Wired 13 February 1998available online via the Internet Archivewwwwebarchiveorgweb20000619001705httpwwwwiredcomnewsculture012841030100html ltviewed 19 December 2012gt

36 Some of this has been discussed in greater depth on my website See theseries beginning with Ian Milligan ldquoWARC Files A Challenge forHistorians and Finding Needles in Haystacksrdquo ianmilliganca 12December 2012 wwwianmilliganca20121212warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks ltviewed 20 December2012gt

37 Unstructured data as opposed to structured data is important for histo-rians but brings additional challenges Some projects have producedstructured data novels marked up using XML encoding for example soone can immediately see place names dates character names and soforth This makes it much easier for a computer to read and understand

38 See William Noblett ldquoDigitization A Cautionary Talerdquo New Review ofAcademic Librarianship 17 (2011) 3 Tobias Blanke Michael Bryantand Mark Hedges ldquoOcropodium Open Source OCR for Small-ScaleHistorical Archivesrdquo Journal of Information Science 38 no 1 (2012) 83

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 30: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

websites Reading down we see information about governments jobexperience project information initiatives re-structuring and soforth At a glance we can see what language is common to all web-sites as opposed to unique information Moving down we find themore common but less universal ldquothe history ofrdquo for example ldquoher-itage community foundationrdquo libraries and so forth By the end ofthe list we can move into the unique or very rare phrases that appearonly once twice or three times

Moving beyond word and phrase frequency there are a fewmore sophisticated off-the-shelf options that can help us make senseof large arrays of unstructured data Topic modeling is among themost developed as Princetonrsquos David M Blei (a Computer Scienceprofessor) describes in a general introduction to the topic

While more and more texts are available online we simplydo not have the human power to read and study them toprovide the kind of [traditional] browsing experiencedescribed above To this end machine learning researchershave developed probabilistic topic modeling a suite of algo-rithms that aim to discover and annotate large archives ofdocuments with thematic information Topic modelingalgorithms are statistical methods that analyze the words ofthe original texts to discover the themes that run throughthem how those themes are connected to each other andhow they change over time hellip Topic modeling enables usto organize and summarize electronic archives at a scalethat would be impossible by human annotation55

Using the aforementioned SEASR Meandre workbench (or viaMALLET itself and a command-line interface) we can quickly beginto run our born-digital information through the topic-modelingalgorithm The results are notable and come in two formats a wordcloud visualization as well as an XML file containing a line-by-linebreakdown of each topic Taking the aforementioned file 267 (ldquopastand presentrdquo) we can begin to get a sense of the topics this large web-site discusses

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

At a glance we can learn about some of the main themes the struc-ture of Alberta (Alberta towns bridges lakes) the geographic shapeof the area (life land lands areas links grass and interestingly lan-guage) the stories (story knight archives but also interestinglyknight and wife) and so forth The word cloud gives us a sense ofthe most relevant terms that define each topic but we also have a3200-line XML file broken into topics which lets us further refineour inquiry

A last set of examples will help us move from the aggregate levelof hundreds of thousands of lines of text to the individual line inessence creating our own search engine to navigate this collection

There are a few good reasons why we should not use Google todo all of this in our own work First we do not know how Googleworks precisely It is a proprietary black box While the original algo-rithm is available the actual methods that go into determining theranking of search terms is a trade secret Second Google is optimizedfor a number of uses mainly to deliver the most relevant informa-tion to the user Historians and archivists are looking often forspecific information within old websites and it behooves us to exam-ine them through the lens of more specialized tools With thesehesitations in mind I have sought to develop my own rudimentarysearch engine for this born-digital collection

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A two-pronged approach can take us to relevant informationquickly In the following example I am carrying out a project onblack history We quickly search all of the CDC word frequency filesfor the occurrence of the word ldquofreedomrdquo ldquoblackrdquo and ldquolibertyrdquo(these terms of course require some background knowledge thesearch is shaped by the user) obtain a list and then map it onto achart in order to quickly visualize their locations in the collection Forexample in the graph above we see a visualization of a given term inthis case our three search terms As one moves the mouse over eachbar the frequency data for a given term appears We can also zoomand enhance this graph to use it as we wish A demonstration of thiscan be found online at wwwianmilliganca20130118cdc-tool-1

We can then move down into the data In the example aboveldquo047rdquo labeled ldquoblackgoldtxtrdquo contains 31 occurrences of the wordldquoontariordquo Moving to my Mathematica keyword-in-context routinewe can see where these occurrences of the word appear

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

A snapshot of the dynamic finding aid tool

We see that this website focuses mostly on southwestern Ontariowhere many oil fields are If we move forward we can see that it per-tains to the nineteenth century a particular region as well assubsequent memorialization activities (mention of a museum forexample) All this can be done from within the textual client asopposed to going out into individual websites Broad contoursemerge from this distant reading but they also facilitate deeper read-ing by allowing us to promptly realize critical sources

Another model through which we can visualize documents isthrough a viewer primarily focusing on term document frequencyand inverse frequency terms discussed below To prepare this mater-ial we are combining our earlier data about word frequency (that theword ldquoexperiencerdquo appears 919 times perhaps) with aggregate infor-mation about frequency in the entire corpus (ie the wordldquoexperiencerdquo appears 1766 times) The number of times that a termappears in a given website can be defined as tf (term frequency)Another opportunity is inverse document frequency where we canlearn what words are most distinct and unique in a given documentThe formula for this is the oft-cited tf-idf a staple of informationretrieval theory or tf times the logarithm of the number of docu-ments divided into the number of documents where the given termappears At a glance then we can see some of the unique words con-tained within a website if we select ldquoinverse document frequencyrdquo

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

At a very quick glance one can see the major words that make this aunique document This technique can be used on any textual docu-ment

Off-the-shelf tools can facilitate a deeper yet distant reading ofindividual websites as well especially the suite of research tools onwebsites such as Many Eyes and Voyant Using these tools howeverrequires that one puts individual data into each separate programrather than quickly navigating between them This stems from themassive amount of files and data involved Many Eyes limits files to5MB and Voyant has no hard and fast rule but can struggle withlarger files

This analysis was carried out on my personal computer55 Largerdata sets may necessitate a corresponding increase in computationalpower bringing social sciences and humanities researchers into con-tact with High Performance Computing (HPC) centres or consortiaThese extremely powerful computers provide researchers with com-puting power orders of magnitude above that of personal computersallowing results in hours instead of the weeks or months it might

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

take to process on a smaller system56 This has particular applicationin the basic processing of a very large web archive There are limita-tions however jobs must be submitted into queues in advancerather than on-site processing and perhaps most importantly HPCcentres prefer to run highly optimized code to minimize usage oftheir very expensive and scarcely allocated equipment As this level ofcoding expertise is often beyond humanities researchers researchteams must be formed

Humanities researchers would do well to familiarize themselveswith local HPC options many of which are more than happy towork on new and engaging questions These are accessible optionsin Ontario for example all but one university belongs to a HPCconsortium57 Yet as personal computing continues to dramaticallyincrease in speed and as storage prices continue to plummet textualanalysis is increasingly feasible on high-quality office systems An ini-tial exposure can generate questions and problems that if the needrears itself can subsequently be taken to the supercomputer

This case study is intended to provide one model for distantreading and the generation of finding aids in the interest of showingsome of the potential for working with born-digital sources It isintended to begin or jumpstart a conversation around the resourcesand skills necessary for the next generation A seemingly insur-mountable amount of information contained across over 73000files (as in the CDC collection) can be quickly navigated and iso-lated for researcher convenience

Conclusion Digital Literacy for the Next Generation ofHistorians

As a discipline I would like to suggest one path forward prepare thenext generation of historians at least some of them to deal with thedeluge of born-digital sources Fundamentally the challenge of born-digital sources is a pedagogical one For existing historians this is anissue of continual education for future historians an issue of profes-sional training (the annual Digital Humanities Summer Instituteheld at the University of Victoria each June is worth considering asan introduction to many digital concepts) Some off-the-shelf tools

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

can serve as gateways to deeper engagement notably the VoyantTools suite (wwwvoyant-toolsorg) and perhaps subsequently thelessons contained in the Programming Historian 2 That said digitalliteracy courses should become part of the essential pantheon of his-torical curricula This should not be mandatory but instead taughtalongside and within traditional historiography and research meth-ods courses In an era of diminishing resources and attention paid toknowledge translation and relevancy digital literacy courses are alsowell positioned to provide graduates with concrete portfolios andactionable skills If we want to be prepared to use digital method-ologies to study the post-1990 world we will need ready historians

Such a course should have six basic priorities First we must pro-vide historiographical context for new students on whatcomputational methods have been tried before How have quantita-tive and qualitative methods differed yet ultimately complementedeach other Second students in this course would need to develop anunderstanding of the digital basics cloud computing effectivebackup and versioning digital security and an understanding of howthe Internet is organized Third for effective cross-departmental andsustained engagement with digital history students must to learn howto digitally navigate conventional sources notable skill sets wouldinclude a familiarity with citation management software databasesand the possibility of using optical character recognition on a varietyof born-paper sources subsequently digitized through scanners orcameras This must also include a grasp of the scope and significanceof large digital depositories including Google Books the InternetArchive and others mentioned earlier Fourth students must learnhow to process large amounts of digital information this includes off-the-shelf visualization tools material discussed here and some of thephilosophical and methodological implications of a distance readingFifth students can learn basic programming through the ProgrammingHistorian Finally and perhaps most importantly students need anunderstanding of various ways to openly disseminate material on the web whether through blogging micro-blogging publishing webpages and so forth it is important to enable the great work that we do in the university to make its mark on the world Throughthis students will also come away from history departments with

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

comprehensive and marketable digital portfolios Such a coursewould focus more on homework than reading with special emphasisplaced on deliverable content such as websites interactive modelsand broadly implementable databases

Digital history and the readiness to work with born-digitalsources cannot stem out of pedagogy alone however Institutionalsupport is necessary While based on anecdotal evidence it appearsthat some university administrators have begun to embrace the shifttowards the digital humanities At the federal level SSHRC has iden-tified the Digital Economy as one of its priority areas58 Yet when itcomes to evaluation doctoral and post-doctoral researchers do nothave a ldquodigital humanitiesrdquo option open to them in terms of assessingcommittees leading to fears that their projects might be lost inbetween disciplinary areas of focus59 In light of the impending crisisof born-digital sources historians should encourage these initiatives

These sorts of digital projects do not always lend themselveswell to traditional publication venues The examples above from theCDC case study for example are best viewed as dynamically mal-leable models this will be one challenge as historians grapple withnew publication models This article argues first why digitalmethodologies are necessary and second how we can realize aspectsof this call The Internet is 20 years old and before historians realizeit we will be faced with the crisis of born-digital information over-load If history is to continue as the leading discipline inunderstanding the social and cultural past decisive movementtowards the digital humanities is necessary When the sea comes inwe must be ready to swim

IAN MILLIGAN is an assistant professor of Canadian history at theUniversity of Waterloo He is a founding co-editor of ActiveHistorycaan editor-at-large of the Programming Historian 2 and has written sev-eral articles in the fields of Canadian youth digital and labour history

IAN MILLIGAN est professeur adjoint en histoire canadienne agrave laUniversity of Waterloo Il est le fondateur et codirecteur du site

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Internet ActiveHistoryca ainsi qursquoun eacutediteur du site ProgrammingHistorian 2 Il a eacutecrit plusieurs articles au sujet de la jeunesse cana-dienne de la numeacuterisation et de lrsquohistoire des travailleurs

Endnotes

1 The question of when the past becomes a ldquohistoryrdquo to be professionallystudied is a tricky one Yet it is worth noting that the still-definitivework on the Baby Boom Generation appeared in 1996 with DougOwram Born at the Right Time A History of the Baby Boom Generationin Canada (Toronto University of Toronto Press 1996) This actuallyfollowed the earlier scholarly monograph Cyril Levitt Children ofPrivilege Student Revolt in the Sixties (Toronto University of TorontoPress 1984) and a host of other laudable works As academicdebatesworks on the 1960s now suggest the past becomes historysooner than we often expect

2 A critical point advanced in the field of literary criticism and theory byStephen Ramsay Reading Machines Towards an Algorithmic Criticism(Urbana University of Illinois Press 2011)

3 See ldquoIdleNoMore Timelinerdquo ongoingwwwidlenomoremakookcatimeline44 ltviewed 18 January 2013gt

4 These issues have been discussed elsewhere albeit in either a moreabridged form or with a focus on digitized historical sources A notableexception is Daniel J Cohen and Roy Rosenzweig Digital History AGuide to Gathering Preserving and Presenting the Past on the Web(Philadelphia University of Pennsylvania Press 2005) specifically theirchapter on ldquoCollecting History onlinerdquo Their focus is on the preservation-ist and archival challenges however Important notes are also found inDan Cohen ldquoHistory and the Second Decade of the Webrdquo RethinkingHistory 8 no 2 (June 2004) 293ndash301 also available online atwwwchnmgmueduessays-on-history-new-mediaessaysessayid=34

5 Figure taken from Jose Martinez ldquoTwitter CEO Dick Costolo Reveals Staggering Number of Tweets Per Dayrdquo ComplexTech wwwcomplexcomtech201210twitter-ceo-dick-costolo-reveals-staggering-number-of-tweets-per-day ltviewed 6 January 2013gtFurther more while a tweet itself is only 140 characters every tweetcomes with over 40 lines of metadata date information author biogra-phy and location information when the account was created the usersrsquofollowing and follower numbers country data and so forth This is aremarkable treasure trove of information for future historians but also apotential source of information overload See Sarah Perez ldquoThis is What

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

a Tweet Looks Likerdquo ReadWriteWeb (19 April 2010) wwwreadwritewebcomarchivesthis_is_what_a_tweet_looks_likephpltviewed 6 January 2013gt

6 This is further discussed in Ian Foster ldquoHow Computation ChangesResearchrdquo in Switching Codes Thinking Through Digital Technology inthe Humanities and the Arts eds Thomas Bartscherer and RoderickCoover (Chicago University of Chicago Press 2011) esp 22ndash4

7 Franco Moretti Graphs Maps Trees Abstract Models for Literary History(New York Verso 2005) 3ndash4

8 While not of prime focus here the longue dureacutee deserves definition Itstands opposite the history of events of instances covering instead avery long time span in an interdisciplinary social science framework In along essay Braudel noted that one can then see both structural crises ofeconomies and structures that constrain human society and develop-ment For more see Fernand Braudel ldquoHistory and the Social SciencesThe Longue Dureacuteerdquo in Fernand Braudel On History trans SarahMatthews (Chicago University of Chicago Press 1980) 27 The essaywas originally printed in the Annales ESC no 4 (OctoberndashDecember1958)

9 A great overview can be found at ldquoText Analysisrdquo Tooling Up for DigitalHumanities 2011 wwwtoolingupstanfordedupage_id=981 ltviewed19 December 2012gt

10 Jean-Baptiste Michel Yuan Kui Shen Aviva Presser Aiden Adrian VeresMatthew K Gray The Google Books Team Joseph P Pickett DaleHoiberg Dan Clancy Peter Norvig Jon Orwant Steven Pinker MartinA Nowak Erez Lieberman Aiden ldquoQuantitative Analysis of CultureUsing Millions of Digitized Booksrdquo Science 331 (14 January 2011)176ndash82 The tool is publicly accessible at ldquoGoogle Books NgramViewerrdquo Google Books 2010 wwwbooksgooglecomngrams ltviewed11 January 2013gt

11 Anthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

12 Comment by Jean-Baptiste Michel and Erez Lieberman Aiden onAnthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

13 For more information see the Canadian Century Research InfrastructureProject website at wwwccriuottawacaCCRIHomehtml and the Programme de recherche en deacutemographie historique at wwwgeneologyumontrealcafrleprdhhtm

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

14 ldquoWith Criminal Intentrdquo CriminalIntentorg wwwcriminalintentorgltviewed 20 December 2012gt

15 For an overview see ldquoAward Recipients mdash 2011rdquo Digging into DataWebpage wwwdiggingintodataorgHomeAwardRecipients2011tabid185Defaultaspx ltviewed 20 December 2012gt

16 One can do no better than the completely free Programming Historianwhich uses entirely free and open-source programming tools to learn thebasics of computational history See The Programming Historian 2 2012-Present wwwprogramminghistorianorg ltviewed 11 January 2013gt

17 Michael Katz The People of Hamilton Canada West Family and Class ina Mid-Nineteenth-Century City (Cambridge MA Harvard UniversityPress 1975) Also see A Gordon Darroch and Michael D OrnsteinldquoEthnicity and Occupational Structure in Canada in 1871 The VerticalMosaic in Historical Perspectiverdquo Canadian Historical Review 61 no 3(1980) 305ndash33

18 Ian Anderson ldquoHistory and Computingrdquo Making History The ChangingFace of the Profession in Britain 2008 wwwhistoryacukmakinghistoryresourcesarticleshistory_and_computinghtml ltviewed 20 December2012gt

19 Christopher Collins ldquoText and Document Visualizationrdquo TheHumanities and Technology Camp Greater Toronto Area VisualizationBootcamp University of Ontario Institute of Technology Oshawa Ont21 October 2011

20 The quotations by Luther and Poe are commonly used to make thispoint For the best example see Clay Shirky Cognitive SurplusCreativity and Generosity in a Connected Age Google eBook Penguin2010 For Mumford see Lewis Mumford The Myth of the Machine vol2 The Pentagon of Power (New York Harcourt Brace 1970) 182 asquoted in Gleick The Information A History A Theory A Flood (NewYork Pantheon 2011) 404

21 Adam Ostrow ldquoCisco CRS-3 Download Library of Congress in JustOver One Secondrdquo Mashable (9 March 2010) accessed onlinewwwmashablecom20100309cisco-crs-3

22 A note on data size is necessary The basic unit of information is a ldquobyterdquo (8 bits) which is roughly a single character A kilobyte (KB) is athousand bytes a megabyte (MB) a million a gigabyte (GB) a billion aterabyte (TB) a trillion and a petabyte (PB) a quadrillion Beyond thisthere are exabytes (EB) zettabytes (ZB) and yottabytes (YB)

23 These figures can of course vary depending on compression methodschosen Their math was based on an average of 8MB per book if it wasscanned at 600dpi and then compressed There are 26 million books inthe Library of Congress For more see Mike Ashenfelder ldquoTransferringlsquoLibraries of Congressrsquo of Datardquo Library of Congress Digital Preservation

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Blog 11 July 2011wwwblogslocgovdigitalpreservation201107transferring-libraries-of-congress-of-data ltviewed 11 January 2013gt

24 Gleick 232 25 ldquoAbout the Library of Congress Web Archivesrdquo Library of Congress Web

Archiving FAQs Undated wwwlocgovwebarchivingfaqhtmlfaqs_02ltviewed 11 January 2013gt

26 ldquo10000000000000000 Bytes Archivedrdquo Internet Archive Blog 26October 2012 wwwblogarchiveorg2012102610000000000000000-bytes-archivedltviewed 19 December 2012gt

27 Jean-Baptiste Michel et al 176ndash82 Published Online Ahead of Print16 December 2010 wwwsciencemagorgcontent3316014176ltviewed 29 July 2011gt See also Scott Cleland ldquoGooglersquos PiracyLiabilityrdquo Forbes 9 November 2011 wwwforbescomsitesscottcle-land20111109googles-piracy-liability ltviewed 11 January 2013gt

28 ldquoGovernment of Canada Web Archiverdquo Library and Archives Canada 17October 2007 wwwcollectionscanadagccawebarchivesindex-ehtmlltviewed 11 January 2013gt

29 For background see Bill Curry ldquoVisiting Library and Archives inOttawa Not Without an Appointmentrdquo Globe and Mail 1 May 2012wwwtheglobeandmailcomnewspoliticsottawa-notebookvisiting-library-and-archives-in-ottawa-not-without-an-appointmentarticle2418960 ltviewed 22 May 2012gt and the announcement ldquoLAC Begins Implementation of New Approach to service Deliveryrdquo Library and Archives Canada February 2012wwwcollectionscanadagccawhats-new013-560-ehtml ltviewed 22May 2012gt For professional responses see the ldquoCHArsquos comment onNew Directions for Library and Archives Canadardquo CHA Website March2012 wwwcha-shccaenNews_39items19html ltviewed 19December 2012gt and ldquoSave Library amp Archives Canadardquo CanadianAssociation of University Teachers launched AprilMay 2012 wwwsavelibraryarchivesca ltviewed 19 December 2012gt

30 Ian Milligan ldquoThe Smokescreen of lsquoModernizationrsquo at Library andArchives Canadardquo ActiveHistoryca 22 May 2012wwwactivehistoryca201205the-smokescreen-of-modernization-at-library-and-archives-canada ltviewed 19 December 2012gt

31 ldquoSave LAC May 2012 Campaign Updaterdquo SaveLibraryArchivesca May2012 wwwsavelibraryarchivescaupdate-2012-05aspx ltviewed 19December 2012gt

32 The challenges and opportunities presented by such projects is extensively discussed in Daniel J Cohen ldquoThe Future of Preserving the Pastrdquo CRM The Journal of Heritage Stewardship 2 no 2

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

(Summer 2005) 6ndash19 also available freely online at the RoyRosenzweig Center for History and New Mediawwwchnmgmueduessays-on-history-new-mediaessaysessayid=39For more see ldquoSeptember 11 Digital Archive Saving the Histories of September 11 2001rdquo Centre for History and New Media andAmerican Social History Project Center for Media and Learningwww911digitalarchiveorg ltviewed 11 January 2013gt ldquoHurricaneDigital Memory Bank Collecting and Preserving the Stories of Katrina and Ritardquo Center for History and New Media wwwhurricanearchiveorg ltviewed 11 January 2013gt and ldquoOccupyArchive Archiving the Occupy Movements from 2011rdquo Centre forHistory and New Media wwwoccupyarchiveorg ltviewed 11 January2013gt

33 Daniel Cohen and Roy Rosenzweig Digital History A Guide to Gathering Preserving and Presenting the Past on the Web(Philadelphia Temple University Press 2005) chapter on ldquoPreservingDigital Historyrdquo The chapter is available fully online atwwwchnmgmuedudigitalhistorypreserving5php ltviewed 19December 2012gt

34 For an overview see ldquoPreserving Our Digital Heritage The NationalDigital Information Infrastructure and Preservation Program 2010 Report a Collaborative Initiative of the Library of CongressrdquoDigitalpreservationgov wwwdigitalpreservationgovmultimediadocumentsNDIIPP2010Report_Postpdf ltviewed 19 December 2012gt

35 Steve Meloan ldquoNo Way to Run a Culturerdquo Wired 13 February 1998available online via the Internet Archivewwwwebarchiveorgweb20000619001705httpwwwwiredcomnewsculture012841030100html ltviewed 19 December 2012gt

36 Some of this has been discussed in greater depth on my website See theseries beginning with Ian Milligan ldquoWARC Files A Challenge forHistorians and Finding Needles in Haystacksrdquo ianmilliganca 12December 2012 wwwianmilliganca20121212warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks ltviewed 20 December2012gt

37 Unstructured data as opposed to structured data is important for histo-rians but brings additional challenges Some projects have producedstructured data novels marked up using XML encoding for example soone can immediately see place names dates character names and soforth This makes it much easier for a computer to read and understand

38 See William Noblett ldquoDigitization A Cautionary Talerdquo New Review ofAcademic Librarianship 17 (2011) 3 Tobias Blanke Michael Bryantand Mark Hedges ldquoOcropodium Open Source OCR for Small-ScaleHistorical Archivesrdquo Journal of Information Science 38 no 1 (2012) 83

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 31: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

At a glance we can learn about some of the main themes the struc-ture of Alberta (Alberta towns bridges lakes) the geographic shapeof the area (life land lands areas links grass and interestingly lan-guage) the stories (story knight archives but also interestinglyknight and wife) and so forth The word cloud gives us a sense ofthe most relevant terms that define each topic but we also have a3200-line XML file broken into topics which lets us further refineour inquiry

A last set of examples will help us move from the aggregate levelof hundreds of thousands of lines of text to the individual line inessence creating our own search engine to navigate this collection

There are a few good reasons why we should not use Google todo all of this in our own work First we do not know how Googleworks precisely It is a proprietary black box While the original algo-rithm is available the actual methods that go into determining theranking of search terms is a trade secret Second Google is optimizedfor a number of uses mainly to deliver the most relevant informa-tion to the user Historians and archivists are looking often forspecific information within old websites and it behooves us to exam-ine them through the lens of more specialized tools With thesehesitations in mind I have sought to develop my own rudimentarysearch engine for this born-digital collection

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

A two-pronged approach can take us to relevant informationquickly In the following example I am carrying out a project onblack history We quickly search all of the CDC word frequency filesfor the occurrence of the word ldquofreedomrdquo ldquoblackrdquo and ldquolibertyrdquo(these terms of course require some background knowledge thesearch is shaped by the user) obtain a list and then map it onto achart in order to quickly visualize their locations in the collection Forexample in the graph above we see a visualization of a given term inthis case our three search terms As one moves the mouse over eachbar the frequency data for a given term appears We can also zoomand enhance this graph to use it as we wish A demonstration of thiscan be found online at wwwianmilliganca20130118cdc-tool-1

We can then move down into the data In the example aboveldquo047rdquo labeled ldquoblackgoldtxtrdquo contains 31 occurrences of the wordldquoontariordquo Moving to my Mathematica keyword-in-context routinewe can see where these occurrences of the word appear

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

A snapshot of the dynamic finding aid tool

We see that this website focuses mostly on southwestern Ontariowhere many oil fields are If we move forward we can see that it per-tains to the nineteenth century a particular region as well assubsequent memorialization activities (mention of a museum forexample) All this can be done from within the textual client asopposed to going out into individual websites Broad contoursemerge from this distant reading but they also facilitate deeper read-ing by allowing us to promptly realize critical sources

Another model through which we can visualize documents isthrough a viewer primarily focusing on term document frequencyand inverse frequency terms discussed below To prepare this mater-ial we are combining our earlier data about word frequency (that theword ldquoexperiencerdquo appears 919 times perhaps) with aggregate infor-mation about frequency in the entire corpus (ie the wordldquoexperiencerdquo appears 1766 times) The number of times that a termappears in a given website can be defined as tf (term frequency)Another opportunity is inverse document frequency where we canlearn what words are most distinct and unique in a given documentThe formula for this is the oft-cited tf-idf a staple of informationretrieval theory or tf times the logarithm of the number of docu-ments divided into the number of documents where the given termappears At a glance then we can see some of the unique words con-tained within a website if we select ldquoinverse document frequencyrdquo

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

At a very quick glance one can see the major words that make this aunique document This technique can be used on any textual docu-ment

Off-the-shelf tools can facilitate a deeper yet distant reading ofindividual websites as well especially the suite of research tools onwebsites such as Many Eyes and Voyant Using these tools howeverrequires that one puts individual data into each separate programrather than quickly navigating between them This stems from themassive amount of files and data involved Many Eyes limits files to5MB and Voyant has no hard and fast rule but can struggle withlarger files

This analysis was carried out on my personal computer55 Largerdata sets may necessitate a corresponding increase in computationalpower bringing social sciences and humanities researchers into con-tact with High Performance Computing (HPC) centres or consortiaThese extremely powerful computers provide researchers with com-puting power orders of magnitude above that of personal computersallowing results in hours instead of the weeks or months it might

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

take to process on a smaller system56 This has particular applicationin the basic processing of a very large web archive There are limita-tions however jobs must be submitted into queues in advancerather than on-site processing and perhaps most importantly HPCcentres prefer to run highly optimized code to minimize usage oftheir very expensive and scarcely allocated equipment As this level ofcoding expertise is often beyond humanities researchers researchteams must be formed

Humanities researchers would do well to familiarize themselveswith local HPC options many of which are more than happy towork on new and engaging questions These are accessible optionsin Ontario for example all but one university belongs to a HPCconsortium57 Yet as personal computing continues to dramaticallyincrease in speed and as storage prices continue to plummet textualanalysis is increasingly feasible on high-quality office systems An ini-tial exposure can generate questions and problems that if the needrears itself can subsequently be taken to the supercomputer

This case study is intended to provide one model for distantreading and the generation of finding aids in the interest of showingsome of the potential for working with born-digital sources It isintended to begin or jumpstart a conversation around the resourcesand skills necessary for the next generation A seemingly insur-mountable amount of information contained across over 73000files (as in the CDC collection) can be quickly navigated and iso-lated for researcher convenience

Conclusion Digital Literacy for the Next Generation ofHistorians

As a discipline I would like to suggest one path forward prepare thenext generation of historians at least some of them to deal with thedeluge of born-digital sources Fundamentally the challenge of born-digital sources is a pedagogical one For existing historians this is anissue of continual education for future historians an issue of profes-sional training (the annual Digital Humanities Summer Instituteheld at the University of Victoria each June is worth considering asan introduction to many digital concepts) Some off-the-shelf tools

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

can serve as gateways to deeper engagement notably the VoyantTools suite (wwwvoyant-toolsorg) and perhaps subsequently thelessons contained in the Programming Historian 2 That said digitalliteracy courses should become part of the essential pantheon of his-torical curricula This should not be mandatory but instead taughtalongside and within traditional historiography and research meth-ods courses In an era of diminishing resources and attention paid toknowledge translation and relevancy digital literacy courses are alsowell positioned to provide graduates with concrete portfolios andactionable skills If we want to be prepared to use digital method-ologies to study the post-1990 world we will need ready historians

Such a course should have six basic priorities First we must pro-vide historiographical context for new students on whatcomputational methods have been tried before How have quantita-tive and qualitative methods differed yet ultimately complementedeach other Second students in this course would need to develop anunderstanding of the digital basics cloud computing effectivebackup and versioning digital security and an understanding of howthe Internet is organized Third for effective cross-departmental andsustained engagement with digital history students must to learn howto digitally navigate conventional sources notable skill sets wouldinclude a familiarity with citation management software databasesand the possibility of using optical character recognition on a varietyof born-paper sources subsequently digitized through scanners orcameras This must also include a grasp of the scope and significanceof large digital depositories including Google Books the InternetArchive and others mentioned earlier Fourth students must learnhow to process large amounts of digital information this includes off-the-shelf visualization tools material discussed here and some of thephilosophical and methodological implications of a distance readingFifth students can learn basic programming through the ProgrammingHistorian Finally and perhaps most importantly students need anunderstanding of various ways to openly disseminate material on the web whether through blogging micro-blogging publishing webpages and so forth it is important to enable the great work that we do in the university to make its mark on the world Throughthis students will also come away from history departments with

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

comprehensive and marketable digital portfolios Such a coursewould focus more on homework than reading with special emphasisplaced on deliverable content such as websites interactive modelsand broadly implementable databases

Digital history and the readiness to work with born-digitalsources cannot stem out of pedagogy alone however Institutionalsupport is necessary While based on anecdotal evidence it appearsthat some university administrators have begun to embrace the shifttowards the digital humanities At the federal level SSHRC has iden-tified the Digital Economy as one of its priority areas58 Yet when itcomes to evaluation doctoral and post-doctoral researchers do nothave a ldquodigital humanitiesrdquo option open to them in terms of assessingcommittees leading to fears that their projects might be lost inbetween disciplinary areas of focus59 In light of the impending crisisof born-digital sources historians should encourage these initiatives

These sorts of digital projects do not always lend themselveswell to traditional publication venues The examples above from theCDC case study for example are best viewed as dynamically mal-leable models this will be one challenge as historians grapple withnew publication models This article argues first why digitalmethodologies are necessary and second how we can realize aspectsof this call The Internet is 20 years old and before historians realizeit we will be faced with the crisis of born-digital information over-load If history is to continue as the leading discipline inunderstanding the social and cultural past decisive movementtowards the digital humanities is necessary When the sea comes inwe must be ready to swim

IAN MILLIGAN is an assistant professor of Canadian history at theUniversity of Waterloo He is a founding co-editor of ActiveHistorycaan editor-at-large of the Programming Historian 2 and has written sev-eral articles in the fields of Canadian youth digital and labour history

IAN MILLIGAN est professeur adjoint en histoire canadienne agrave laUniversity of Waterloo Il est le fondateur et codirecteur du site

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Internet ActiveHistoryca ainsi qursquoun eacutediteur du site ProgrammingHistorian 2 Il a eacutecrit plusieurs articles au sujet de la jeunesse cana-dienne de la numeacuterisation et de lrsquohistoire des travailleurs

Endnotes

1 The question of when the past becomes a ldquohistoryrdquo to be professionallystudied is a tricky one Yet it is worth noting that the still-definitivework on the Baby Boom Generation appeared in 1996 with DougOwram Born at the Right Time A History of the Baby Boom Generationin Canada (Toronto University of Toronto Press 1996) This actuallyfollowed the earlier scholarly monograph Cyril Levitt Children ofPrivilege Student Revolt in the Sixties (Toronto University of TorontoPress 1984) and a host of other laudable works As academicdebatesworks on the 1960s now suggest the past becomes historysooner than we often expect

2 A critical point advanced in the field of literary criticism and theory byStephen Ramsay Reading Machines Towards an Algorithmic Criticism(Urbana University of Illinois Press 2011)

3 See ldquoIdleNoMore Timelinerdquo ongoingwwwidlenomoremakookcatimeline44 ltviewed 18 January 2013gt

4 These issues have been discussed elsewhere albeit in either a moreabridged form or with a focus on digitized historical sources A notableexception is Daniel J Cohen and Roy Rosenzweig Digital History AGuide to Gathering Preserving and Presenting the Past on the Web(Philadelphia University of Pennsylvania Press 2005) specifically theirchapter on ldquoCollecting History onlinerdquo Their focus is on the preservation-ist and archival challenges however Important notes are also found inDan Cohen ldquoHistory and the Second Decade of the Webrdquo RethinkingHistory 8 no 2 (June 2004) 293ndash301 also available online atwwwchnmgmueduessays-on-history-new-mediaessaysessayid=34

5 Figure taken from Jose Martinez ldquoTwitter CEO Dick Costolo Reveals Staggering Number of Tweets Per Dayrdquo ComplexTech wwwcomplexcomtech201210twitter-ceo-dick-costolo-reveals-staggering-number-of-tweets-per-day ltviewed 6 January 2013gtFurther more while a tweet itself is only 140 characters every tweetcomes with over 40 lines of metadata date information author biogra-phy and location information when the account was created the usersrsquofollowing and follower numbers country data and so forth This is aremarkable treasure trove of information for future historians but also apotential source of information overload See Sarah Perez ldquoThis is What

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

a Tweet Looks Likerdquo ReadWriteWeb (19 April 2010) wwwreadwritewebcomarchivesthis_is_what_a_tweet_looks_likephpltviewed 6 January 2013gt

6 This is further discussed in Ian Foster ldquoHow Computation ChangesResearchrdquo in Switching Codes Thinking Through Digital Technology inthe Humanities and the Arts eds Thomas Bartscherer and RoderickCoover (Chicago University of Chicago Press 2011) esp 22ndash4

7 Franco Moretti Graphs Maps Trees Abstract Models for Literary History(New York Verso 2005) 3ndash4

8 While not of prime focus here the longue dureacutee deserves definition Itstands opposite the history of events of instances covering instead avery long time span in an interdisciplinary social science framework In along essay Braudel noted that one can then see both structural crises ofeconomies and structures that constrain human society and develop-ment For more see Fernand Braudel ldquoHistory and the Social SciencesThe Longue Dureacuteerdquo in Fernand Braudel On History trans SarahMatthews (Chicago University of Chicago Press 1980) 27 The essaywas originally printed in the Annales ESC no 4 (OctoberndashDecember1958)

9 A great overview can be found at ldquoText Analysisrdquo Tooling Up for DigitalHumanities 2011 wwwtoolingupstanfordedupage_id=981 ltviewed19 December 2012gt

10 Jean-Baptiste Michel Yuan Kui Shen Aviva Presser Aiden Adrian VeresMatthew K Gray The Google Books Team Joseph P Pickett DaleHoiberg Dan Clancy Peter Norvig Jon Orwant Steven Pinker MartinA Nowak Erez Lieberman Aiden ldquoQuantitative Analysis of CultureUsing Millions of Digitized Booksrdquo Science 331 (14 January 2011)176ndash82 The tool is publicly accessible at ldquoGoogle Books NgramViewerrdquo Google Books 2010 wwwbooksgooglecomngrams ltviewed11 January 2013gt

11 Anthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

12 Comment by Jean-Baptiste Michel and Erez Lieberman Aiden onAnthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

13 For more information see the Canadian Century Research InfrastructureProject website at wwwccriuottawacaCCRIHomehtml and the Programme de recherche en deacutemographie historique at wwwgeneologyumontrealcafrleprdhhtm

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

14 ldquoWith Criminal Intentrdquo CriminalIntentorg wwwcriminalintentorgltviewed 20 December 2012gt

15 For an overview see ldquoAward Recipients mdash 2011rdquo Digging into DataWebpage wwwdiggingintodataorgHomeAwardRecipients2011tabid185Defaultaspx ltviewed 20 December 2012gt

16 One can do no better than the completely free Programming Historianwhich uses entirely free and open-source programming tools to learn thebasics of computational history See The Programming Historian 2 2012-Present wwwprogramminghistorianorg ltviewed 11 January 2013gt

17 Michael Katz The People of Hamilton Canada West Family and Class ina Mid-Nineteenth-Century City (Cambridge MA Harvard UniversityPress 1975) Also see A Gordon Darroch and Michael D OrnsteinldquoEthnicity and Occupational Structure in Canada in 1871 The VerticalMosaic in Historical Perspectiverdquo Canadian Historical Review 61 no 3(1980) 305ndash33

18 Ian Anderson ldquoHistory and Computingrdquo Making History The ChangingFace of the Profession in Britain 2008 wwwhistoryacukmakinghistoryresourcesarticleshistory_and_computinghtml ltviewed 20 December2012gt

19 Christopher Collins ldquoText and Document Visualizationrdquo TheHumanities and Technology Camp Greater Toronto Area VisualizationBootcamp University of Ontario Institute of Technology Oshawa Ont21 October 2011

20 The quotations by Luther and Poe are commonly used to make thispoint For the best example see Clay Shirky Cognitive SurplusCreativity and Generosity in a Connected Age Google eBook Penguin2010 For Mumford see Lewis Mumford The Myth of the Machine vol2 The Pentagon of Power (New York Harcourt Brace 1970) 182 asquoted in Gleick The Information A History A Theory A Flood (NewYork Pantheon 2011) 404

21 Adam Ostrow ldquoCisco CRS-3 Download Library of Congress in JustOver One Secondrdquo Mashable (9 March 2010) accessed onlinewwwmashablecom20100309cisco-crs-3

22 A note on data size is necessary The basic unit of information is a ldquobyterdquo (8 bits) which is roughly a single character A kilobyte (KB) is athousand bytes a megabyte (MB) a million a gigabyte (GB) a billion aterabyte (TB) a trillion and a petabyte (PB) a quadrillion Beyond thisthere are exabytes (EB) zettabytes (ZB) and yottabytes (YB)

23 These figures can of course vary depending on compression methodschosen Their math was based on an average of 8MB per book if it wasscanned at 600dpi and then compressed There are 26 million books inthe Library of Congress For more see Mike Ashenfelder ldquoTransferringlsquoLibraries of Congressrsquo of Datardquo Library of Congress Digital Preservation

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Blog 11 July 2011wwwblogslocgovdigitalpreservation201107transferring-libraries-of-congress-of-data ltviewed 11 January 2013gt

24 Gleick 232 25 ldquoAbout the Library of Congress Web Archivesrdquo Library of Congress Web

Archiving FAQs Undated wwwlocgovwebarchivingfaqhtmlfaqs_02ltviewed 11 January 2013gt

26 ldquo10000000000000000 Bytes Archivedrdquo Internet Archive Blog 26October 2012 wwwblogarchiveorg2012102610000000000000000-bytes-archivedltviewed 19 December 2012gt

27 Jean-Baptiste Michel et al 176ndash82 Published Online Ahead of Print16 December 2010 wwwsciencemagorgcontent3316014176ltviewed 29 July 2011gt See also Scott Cleland ldquoGooglersquos PiracyLiabilityrdquo Forbes 9 November 2011 wwwforbescomsitesscottcle-land20111109googles-piracy-liability ltviewed 11 January 2013gt

28 ldquoGovernment of Canada Web Archiverdquo Library and Archives Canada 17October 2007 wwwcollectionscanadagccawebarchivesindex-ehtmlltviewed 11 January 2013gt

29 For background see Bill Curry ldquoVisiting Library and Archives inOttawa Not Without an Appointmentrdquo Globe and Mail 1 May 2012wwwtheglobeandmailcomnewspoliticsottawa-notebookvisiting-library-and-archives-in-ottawa-not-without-an-appointmentarticle2418960 ltviewed 22 May 2012gt and the announcement ldquoLAC Begins Implementation of New Approach to service Deliveryrdquo Library and Archives Canada February 2012wwwcollectionscanadagccawhats-new013-560-ehtml ltviewed 22May 2012gt For professional responses see the ldquoCHArsquos comment onNew Directions for Library and Archives Canadardquo CHA Website March2012 wwwcha-shccaenNews_39items19html ltviewed 19December 2012gt and ldquoSave Library amp Archives Canadardquo CanadianAssociation of University Teachers launched AprilMay 2012 wwwsavelibraryarchivesca ltviewed 19 December 2012gt

30 Ian Milligan ldquoThe Smokescreen of lsquoModernizationrsquo at Library andArchives Canadardquo ActiveHistoryca 22 May 2012wwwactivehistoryca201205the-smokescreen-of-modernization-at-library-and-archives-canada ltviewed 19 December 2012gt

31 ldquoSave LAC May 2012 Campaign Updaterdquo SaveLibraryArchivesca May2012 wwwsavelibraryarchivescaupdate-2012-05aspx ltviewed 19December 2012gt

32 The challenges and opportunities presented by such projects is extensively discussed in Daniel J Cohen ldquoThe Future of Preserving the Pastrdquo CRM The Journal of Heritage Stewardship 2 no 2

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

(Summer 2005) 6ndash19 also available freely online at the RoyRosenzweig Center for History and New Mediawwwchnmgmueduessays-on-history-new-mediaessaysessayid=39For more see ldquoSeptember 11 Digital Archive Saving the Histories of September 11 2001rdquo Centre for History and New Media andAmerican Social History Project Center for Media and Learningwww911digitalarchiveorg ltviewed 11 January 2013gt ldquoHurricaneDigital Memory Bank Collecting and Preserving the Stories of Katrina and Ritardquo Center for History and New Media wwwhurricanearchiveorg ltviewed 11 January 2013gt and ldquoOccupyArchive Archiving the Occupy Movements from 2011rdquo Centre forHistory and New Media wwwoccupyarchiveorg ltviewed 11 January2013gt

33 Daniel Cohen and Roy Rosenzweig Digital History A Guide to Gathering Preserving and Presenting the Past on the Web(Philadelphia Temple University Press 2005) chapter on ldquoPreservingDigital Historyrdquo The chapter is available fully online atwwwchnmgmuedudigitalhistorypreserving5php ltviewed 19December 2012gt

34 For an overview see ldquoPreserving Our Digital Heritage The NationalDigital Information Infrastructure and Preservation Program 2010 Report a Collaborative Initiative of the Library of CongressrdquoDigitalpreservationgov wwwdigitalpreservationgovmultimediadocumentsNDIIPP2010Report_Postpdf ltviewed 19 December 2012gt

35 Steve Meloan ldquoNo Way to Run a Culturerdquo Wired 13 February 1998available online via the Internet Archivewwwwebarchiveorgweb20000619001705httpwwwwiredcomnewsculture012841030100html ltviewed 19 December 2012gt

36 Some of this has been discussed in greater depth on my website See theseries beginning with Ian Milligan ldquoWARC Files A Challenge forHistorians and Finding Needles in Haystacksrdquo ianmilliganca 12December 2012 wwwianmilliganca20121212warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks ltviewed 20 December2012gt

37 Unstructured data as opposed to structured data is important for histo-rians but brings additional challenges Some projects have producedstructured data novels marked up using XML encoding for example soone can immediately see place names dates character names and soforth This makes it much easier for a computer to read and understand

38 See William Noblett ldquoDigitization A Cautionary Talerdquo New Review ofAcademic Librarianship 17 (2011) 3 Tobias Blanke Michael Bryantand Mark Hedges ldquoOcropodium Open Source OCR for Small-ScaleHistorical Archivesrdquo Journal of Information Science 38 no 1 (2012) 83

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 32: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

A two-pronged approach can take us to relevant informationquickly In the following example I am carrying out a project onblack history We quickly search all of the CDC word frequency filesfor the occurrence of the word ldquofreedomrdquo ldquoblackrdquo and ldquolibertyrdquo(these terms of course require some background knowledge thesearch is shaped by the user) obtain a list and then map it onto achart in order to quickly visualize their locations in the collection Forexample in the graph above we see a visualization of a given term inthis case our three search terms As one moves the mouse over eachbar the frequency data for a given term appears We can also zoomand enhance this graph to use it as we wish A demonstration of thiscan be found online at wwwianmilliganca20130118cdc-tool-1

We can then move down into the data In the example aboveldquo047rdquo labeled ldquoblackgoldtxtrdquo contains 31 occurrences of the wordldquoontariordquo Moving to my Mathematica keyword-in-context routinewe can see where these occurrences of the word appear

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

A snapshot of the dynamic finding aid tool

We see that this website focuses mostly on southwestern Ontariowhere many oil fields are If we move forward we can see that it per-tains to the nineteenth century a particular region as well assubsequent memorialization activities (mention of a museum forexample) All this can be done from within the textual client asopposed to going out into individual websites Broad contoursemerge from this distant reading but they also facilitate deeper read-ing by allowing us to promptly realize critical sources

Another model through which we can visualize documents isthrough a viewer primarily focusing on term document frequencyand inverse frequency terms discussed below To prepare this mater-ial we are combining our earlier data about word frequency (that theword ldquoexperiencerdquo appears 919 times perhaps) with aggregate infor-mation about frequency in the entire corpus (ie the wordldquoexperiencerdquo appears 1766 times) The number of times that a termappears in a given website can be defined as tf (term frequency)Another opportunity is inverse document frequency where we canlearn what words are most distinct and unique in a given documentThe formula for this is the oft-cited tf-idf a staple of informationretrieval theory or tf times the logarithm of the number of docu-ments divided into the number of documents where the given termappears At a glance then we can see some of the unique words con-tained within a website if we select ldquoinverse document frequencyrdquo

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

At a very quick glance one can see the major words that make this aunique document This technique can be used on any textual docu-ment

Off-the-shelf tools can facilitate a deeper yet distant reading ofindividual websites as well especially the suite of research tools onwebsites such as Many Eyes and Voyant Using these tools howeverrequires that one puts individual data into each separate programrather than quickly navigating between them This stems from themassive amount of files and data involved Many Eyes limits files to5MB and Voyant has no hard and fast rule but can struggle withlarger files

This analysis was carried out on my personal computer55 Largerdata sets may necessitate a corresponding increase in computationalpower bringing social sciences and humanities researchers into con-tact with High Performance Computing (HPC) centres or consortiaThese extremely powerful computers provide researchers with com-puting power orders of magnitude above that of personal computersallowing results in hours instead of the weeks or months it might

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

take to process on a smaller system56 This has particular applicationin the basic processing of a very large web archive There are limita-tions however jobs must be submitted into queues in advancerather than on-site processing and perhaps most importantly HPCcentres prefer to run highly optimized code to minimize usage oftheir very expensive and scarcely allocated equipment As this level ofcoding expertise is often beyond humanities researchers researchteams must be formed

Humanities researchers would do well to familiarize themselveswith local HPC options many of which are more than happy towork on new and engaging questions These are accessible optionsin Ontario for example all but one university belongs to a HPCconsortium57 Yet as personal computing continues to dramaticallyincrease in speed and as storage prices continue to plummet textualanalysis is increasingly feasible on high-quality office systems An ini-tial exposure can generate questions and problems that if the needrears itself can subsequently be taken to the supercomputer

This case study is intended to provide one model for distantreading and the generation of finding aids in the interest of showingsome of the potential for working with born-digital sources It isintended to begin or jumpstart a conversation around the resourcesand skills necessary for the next generation A seemingly insur-mountable amount of information contained across over 73000files (as in the CDC collection) can be quickly navigated and iso-lated for researcher convenience

Conclusion Digital Literacy for the Next Generation ofHistorians

As a discipline I would like to suggest one path forward prepare thenext generation of historians at least some of them to deal with thedeluge of born-digital sources Fundamentally the challenge of born-digital sources is a pedagogical one For existing historians this is anissue of continual education for future historians an issue of profes-sional training (the annual Digital Humanities Summer Instituteheld at the University of Victoria each June is worth considering asan introduction to many digital concepts) Some off-the-shelf tools

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

can serve as gateways to deeper engagement notably the VoyantTools suite (wwwvoyant-toolsorg) and perhaps subsequently thelessons contained in the Programming Historian 2 That said digitalliteracy courses should become part of the essential pantheon of his-torical curricula This should not be mandatory but instead taughtalongside and within traditional historiography and research meth-ods courses In an era of diminishing resources and attention paid toknowledge translation and relevancy digital literacy courses are alsowell positioned to provide graduates with concrete portfolios andactionable skills If we want to be prepared to use digital method-ologies to study the post-1990 world we will need ready historians

Such a course should have six basic priorities First we must pro-vide historiographical context for new students on whatcomputational methods have been tried before How have quantita-tive and qualitative methods differed yet ultimately complementedeach other Second students in this course would need to develop anunderstanding of the digital basics cloud computing effectivebackup and versioning digital security and an understanding of howthe Internet is organized Third for effective cross-departmental andsustained engagement with digital history students must to learn howto digitally navigate conventional sources notable skill sets wouldinclude a familiarity with citation management software databasesand the possibility of using optical character recognition on a varietyof born-paper sources subsequently digitized through scanners orcameras This must also include a grasp of the scope and significanceof large digital depositories including Google Books the InternetArchive and others mentioned earlier Fourth students must learnhow to process large amounts of digital information this includes off-the-shelf visualization tools material discussed here and some of thephilosophical and methodological implications of a distance readingFifth students can learn basic programming through the ProgrammingHistorian Finally and perhaps most importantly students need anunderstanding of various ways to openly disseminate material on the web whether through blogging micro-blogging publishing webpages and so forth it is important to enable the great work that we do in the university to make its mark on the world Throughthis students will also come away from history departments with

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

comprehensive and marketable digital portfolios Such a coursewould focus more on homework than reading with special emphasisplaced on deliverable content such as websites interactive modelsand broadly implementable databases

Digital history and the readiness to work with born-digitalsources cannot stem out of pedagogy alone however Institutionalsupport is necessary While based on anecdotal evidence it appearsthat some university administrators have begun to embrace the shifttowards the digital humanities At the federal level SSHRC has iden-tified the Digital Economy as one of its priority areas58 Yet when itcomes to evaluation doctoral and post-doctoral researchers do nothave a ldquodigital humanitiesrdquo option open to them in terms of assessingcommittees leading to fears that their projects might be lost inbetween disciplinary areas of focus59 In light of the impending crisisof born-digital sources historians should encourage these initiatives

These sorts of digital projects do not always lend themselveswell to traditional publication venues The examples above from theCDC case study for example are best viewed as dynamically mal-leable models this will be one challenge as historians grapple withnew publication models This article argues first why digitalmethodologies are necessary and second how we can realize aspectsof this call The Internet is 20 years old and before historians realizeit we will be faced with the crisis of born-digital information over-load If history is to continue as the leading discipline inunderstanding the social and cultural past decisive movementtowards the digital humanities is necessary When the sea comes inwe must be ready to swim

IAN MILLIGAN is an assistant professor of Canadian history at theUniversity of Waterloo He is a founding co-editor of ActiveHistorycaan editor-at-large of the Programming Historian 2 and has written sev-eral articles in the fields of Canadian youth digital and labour history

IAN MILLIGAN est professeur adjoint en histoire canadienne agrave laUniversity of Waterloo Il est le fondateur et codirecteur du site

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Internet ActiveHistoryca ainsi qursquoun eacutediteur du site ProgrammingHistorian 2 Il a eacutecrit plusieurs articles au sujet de la jeunesse cana-dienne de la numeacuterisation et de lrsquohistoire des travailleurs

Endnotes

1 The question of when the past becomes a ldquohistoryrdquo to be professionallystudied is a tricky one Yet it is worth noting that the still-definitivework on the Baby Boom Generation appeared in 1996 with DougOwram Born at the Right Time A History of the Baby Boom Generationin Canada (Toronto University of Toronto Press 1996) This actuallyfollowed the earlier scholarly monograph Cyril Levitt Children ofPrivilege Student Revolt in the Sixties (Toronto University of TorontoPress 1984) and a host of other laudable works As academicdebatesworks on the 1960s now suggest the past becomes historysooner than we often expect

2 A critical point advanced in the field of literary criticism and theory byStephen Ramsay Reading Machines Towards an Algorithmic Criticism(Urbana University of Illinois Press 2011)

3 See ldquoIdleNoMore Timelinerdquo ongoingwwwidlenomoremakookcatimeline44 ltviewed 18 January 2013gt

4 These issues have been discussed elsewhere albeit in either a moreabridged form or with a focus on digitized historical sources A notableexception is Daniel J Cohen and Roy Rosenzweig Digital History AGuide to Gathering Preserving and Presenting the Past on the Web(Philadelphia University of Pennsylvania Press 2005) specifically theirchapter on ldquoCollecting History onlinerdquo Their focus is on the preservation-ist and archival challenges however Important notes are also found inDan Cohen ldquoHistory and the Second Decade of the Webrdquo RethinkingHistory 8 no 2 (June 2004) 293ndash301 also available online atwwwchnmgmueduessays-on-history-new-mediaessaysessayid=34

5 Figure taken from Jose Martinez ldquoTwitter CEO Dick Costolo Reveals Staggering Number of Tweets Per Dayrdquo ComplexTech wwwcomplexcomtech201210twitter-ceo-dick-costolo-reveals-staggering-number-of-tweets-per-day ltviewed 6 January 2013gtFurther more while a tweet itself is only 140 characters every tweetcomes with over 40 lines of metadata date information author biogra-phy and location information when the account was created the usersrsquofollowing and follower numbers country data and so forth This is aremarkable treasure trove of information for future historians but also apotential source of information overload See Sarah Perez ldquoThis is What

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

a Tweet Looks Likerdquo ReadWriteWeb (19 April 2010) wwwreadwritewebcomarchivesthis_is_what_a_tweet_looks_likephpltviewed 6 January 2013gt

6 This is further discussed in Ian Foster ldquoHow Computation ChangesResearchrdquo in Switching Codes Thinking Through Digital Technology inthe Humanities and the Arts eds Thomas Bartscherer and RoderickCoover (Chicago University of Chicago Press 2011) esp 22ndash4

7 Franco Moretti Graphs Maps Trees Abstract Models for Literary History(New York Verso 2005) 3ndash4

8 While not of prime focus here the longue dureacutee deserves definition Itstands opposite the history of events of instances covering instead avery long time span in an interdisciplinary social science framework In along essay Braudel noted that one can then see both structural crises ofeconomies and structures that constrain human society and develop-ment For more see Fernand Braudel ldquoHistory and the Social SciencesThe Longue Dureacuteerdquo in Fernand Braudel On History trans SarahMatthews (Chicago University of Chicago Press 1980) 27 The essaywas originally printed in the Annales ESC no 4 (OctoberndashDecember1958)

9 A great overview can be found at ldquoText Analysisrdquo Tooling Up for DigitalHumanities 2011 wwwtoolingupstanfordedupage_id=981 ltviewed19 December 2012gt

10 Jean-Baptiste Michel Yuan Kui Shen Aviva Presser Aiden Adrian VeresMatthew K Gray The Google Books Team Joseph P Pickett DaleHoiberg Dan Clancy Peter Norvig Jon Orwant Steven Pinker MartinA Nowak Erez Lieberman Aiden ldquoQuantitative Analysis of CultureUsing Millions of Digitized Booksrdquo Science 331 (14 January 2011)176ndash82 The tool is publicly accessible at ldquoGoogle Books NgramViewerrdquo Google Books 2010 wwwbooksgooglecomngrams ltviewed11 January 2013gt

11 Anthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

12 Comment by Jean-Baptiste Michel and Erez Lieberman Aiden onAnthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

13 For more information see the Canadian Century Research InfrastructureProject website at wwwccriuottawacaCCRIHomehtml and the Programme de recherche en deacutemographie historique at wwwgeneologyumontrealcafrleprdhhtm

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

14 ldquoWith Criminal Intentrdquo CriminalIntentorg wwwcriminalintentorgltviewed 20 December 2012gt

15 For an overview see ldquoAward Recipients mdash 2011rdquo Digging into DataWebpage wwwdiggingintodataorgHomeAwardRecipients2011tabid185Defaultaspx ltviewed 20 December 2012gt

16 One can do no better than the completely free Programming Historianwhich uses entirely free and open-source programming tools to learn thebasics of computational history See The Programming Historian 2 2012-Present wwwprogramminghistorianorg ltviewed 11 January 2013gt

17 Michael Katz The People of Hamilton Canada West Family and Class ina Mid-Nineteenth-Century City (Cambridge MA Harvard UniversityPress 1975) Also see A Gordon Darroch and Michael D OrnsteinldquoEthnicity and Occupational Structure in Canada in 1871 The VerticalMosaic in Historical Perspectiverdquo Canadian Historical Review 61 no 3(1980) 305ndash33

18 Ian Anderson ldquoHistory and Computingrdquo Making History The ChangingFace of the Profession in Britain 2008 wwwhistoryacukmakinghistoryresourcesarticleshistory_and_computinghtml ltviewed 20 December2012gt

19 Christopher Collins ldquoText and Document Visualizationrdquo TheHumanities and Technology Camp Greater Toronto Area VisualizationBootcamp University of Ontario Institute of Technology Oshawa Ont21 October 2011

20 The quotations by Luther and Poe are commonly used to make thispoint For the best example see Clay Shirky Cognitive SurplusCreativity and Generosity in a Connected Age Google eBook Penguin2010 For Mumford see Lewis Mumford The Myth of the Machine vol2 The Pentagon of Power (New York Harcourt Brace 1970) 182 asquoted in Gleick The Information A History A Theory A Flood (NewYork Pantheon 2011) 404

21 Adam Ostrow ldquoCisco CRS-3 Download Library of Congress in JustOver One Secondrdquo Mashable (9 March 2010) accessed onlinewwwmashablecom20100309cisco-crs-3

22 A note on data size is necessary The basic unit of information is a ldquobyterdquo (8 bits) which is roughly a single character A kilobyte (KB) is athousand bytes a megabyte (MB) a million a gigabyte (GB) a billion aterabyte (TB) a trillion and a petabyte (PB) a quadrillion Beyond thisthere are exabytes (EB) zettabytes (ZB) and yottabytes (YB)

23 These figures can of course vary depending on compression methodschosen Their math was based on an average of 8MB per book if it wasscanned at 600dpi and then compressed There are 26 million books inthe Library of Congress For more see Mike Ashenfelder ldquoTransferringlsquoLibraries of Congressrsquo of Datardquo Library of Congress Digital Preservation

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Blog 11 July 2011wwwblogslocgovdigitalpreservation201107transferring-libraries-of-congress-of-data ltviewed 11 January 2013gt

24 Gleick 232 25 ldquoAbout the Library of Congress Web Archivesrdquo Library of Congress Web

Archiving FAQs Undated wwwlocgovwebarchivingfaqhtmlfaqs_02ltviewed 11 January 2013gt

26 ldquo10000000000000000 Bytes Archivedrdquo Internet Archive Blog 26October 2012 wwwblogarchiveorg2012102610000000000000000-bytes-archivedltviewed 19 December 2012gt

27 Jean-Baptiste Michel et al 176ndash82 Published Online Ahead of Print16 December 2010 wwwsciencemagorgcontent3316014176ltviewed 29 July 2011gt See also Scott Cleland ldquoGooglersquos PiracyLiabilityrdquo Forbes 9 November 2011 wwwforbescomsitesscottcle-land20111109googles-piracy-liability ltviewed 11 January 2013gt

28 ldquoGovernment of Canada Web Archiverdquo Library and Archives Canada 17October 2007 wwwcollectionscanadagccawebarchivesindex-ehtmlltviewed 11 January 2013gt

29 For background see Bill Curry ldquoVisiting Library and Archives inOttawa Not Without an Appointmentrdquo Globe and Mail 1 May 2012wwwtheglobeandmailcomnewspoliticsottawa-notebookvisiting-library-and-archives-in-ottawa-not-without-an-appointmentarticle2418960 ltviewed 22 May 2012gt and the announcement ldquoLAC Begins Implementation of New Approach to service Deliveryrdquo Library and Archives Canada February 2012wwwcollectionscanadagccawhats-new013-560-ehtml ltviewed 22May 2012gt For professional responses see the ldquoCHArsquos comment onNew Directions for Library and Archives Canadardquo CHA Website March2012 wwwcha-shccaenNews_39items19html ltviewed 19December 2012gt and ldquoSave Library amp Archives Canadardquo CanadianAssociation of University Teachers launched AprilMay 2012 wwwsavelibraryarchivesca ltviewed 19 December 2012gt

30 Ian Milligan ldquoThe Smokescreen of lsquoModernizationrsquo at Library andArchives Canadardquo ActiveHistoryca 22 May 2012wwwactivehistoryca201205the-smokescreen-of-modernization-at-library-and-archives-canada ltviewed 19 December 2012gt

31 ldquoSave LAC May 2012 Campaign Updaterdquo SaveLibraryArchivesca May2012 wwwsavelibraryarchivescaupdate-2012-05aspx ltviewed 19December 2012gt

32 The challenges and opportunities presented by such projects is extensively discussed in Daniel J Cohen ldquoThe Future of Preserving the Pastrdquo CRM The Journal of Heritage Stewardship 2 no 2

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

(Summer 2005) 6ndash19 also available freely online at the RoyRosenzweig Center for History and New Mediawwwchnmgmueduessays-on-history-new-mediaessaysessayid=39For more see ldquoSeptember 11 Digital Archive Saving the Histories of September 11 2001rdquo Centre for History and New Media andAmerican Social History Project Center for Media and Learningwww911digitalarchiveorg ltviewed 11 January 2013gt ldquoHurricaneDigital Memory Bank Collecting and Preserving the Stories of Katrina and Ritardquo Center for History and New Media wwwhurricanearchiveorg ltviewed 11 January 2013gt and ldquoOccupyArchive Archiving the Occupy Movements from 2011rdquo Centre forHistory and New Media wwwoccupyarchiveorg ltviewed 11 January2013gt

33 Daniel Cohen and Roy Rosenzweig Digital History A Guide to Gathering Preserving and Presenting the Past on the Web(Philadelphia Temple University Press 2005) chapter on ldquoPreservingDigital Historyrdquo The chapter is available fully online atwwwchnmgmuedudigitalhistorypreserving5php ltviewed 19December 2012gt

34 For an overview see ldquoPreserving Our Digital Heritage The NationalDigital Information Infrastructure and Preservation Program 2010 Report a Collaborative Initiative of the Library of CongressrdquoDigitalpreservationgov wwwdigitalpreservationgovmultimediadocumentsNDIIPP2010Report_Postpdf ltviewed 19 December 2012gt

35 Steve Meloan ldquoNo Way to Run a Culturerdquo Wired 13 February 1998available online via the Internet Archivewwwwebarchiveorgweb20000619001705httpwwwwiredcomnewsculture012841030100html ltviewed 19 December 2012gt

36 Some of this has been discussed in greater depth on my website See theseries beginning with Ian Milligan ldquoWARC Files A Challenge forHistorians and Finding Needles in Haystacksrdquo ianmilliganca 12December 2012 wwwianmilliganca20121212warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks ltviewed 20 December2012gt

37 Unstructured data as opposed to structured data is important for histo-rians but brings additional challenges Some projects have producedstructured data novels marked up using XML encoding for example soone can immediately see place names dates character names and soforth This makes it much easier for a computer to read and understand

38 See William Noblett ldquoDigitization A Cautionary Talerdquo New Review ofAcademic Librarianship 17 (2011) 3 Tobias Blanke Michael Bryantand Mark Hedges ldquoOcropodium Open Source OCR for Small-ScaleHistorical Archivesrdquo Journal of Information Science 38 no 1 (2012) 83

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 33: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

We see that this website focuses mostly on southwestern Ontariowhere many oil fields are If we move forward we can see that it per-tains to the nineteenth century a particular region as well assubsequent memorialization activities (mention of a museum forexample) All this can be done from within the textual client asopposed to going out into individual websites Broad contoursemerge from this distant reading but they also facilitate deeper read-ing by allowing us to promptly realize critical sources

Another model through which we can visualize documents isthrough a viewer primarily focusing on term document frequencyand inverse frequency terms discussed below To prepare this mater-ial we are combining our earlier data about word frequency (that theword ldquoexperiencerdquo appears 919 times perhaps) with aggregate infor-mation about frequency in the entire corpus (ie the wordldquoexperiencerdquo appears 1766 times) The number of times that a termappears in a given website can be defined as tf (term frequency)Another opportunity is inverse document frequency where we canlearn what words are most distinct and unique in a given documentThe formula for this is the oft-cited tf-idf a staple of informationretrieval theory or tf times the logarithm of the number of docu-ments divided into the number of documents where the given termappears At a glance then we can see some of the unique words con-tained within a website if we select ldquoinverse document frequencyrdquo

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

At a very quick glance one can see the major words that make this aunique document This technique can be used on any textual docu-ment

Off-the-shelf tools can facilitate a deeper yet distant reading ofindividual websites as well especially the suite of research tools onwebsites such as Many Eyes and Voyant Using these tools howeverrequires that one puts individual data into each separate programrather than quickly navigating between them This stems from themassive amount of files and data involved Many Eyes limits files to5MB and Voyant has no hard and fast rule but can struggle withlarger files

This analysis was carried out on my personal computer55 Largerdata sets may necessitate a corresponding increase in computationalpower bringing social sciences and humanities researchers into con-tact with High Performance Computing (HPC) centres or consortiaThese extremely powerful computers provide researchers with com-puting power orders of magnitude above that of personal computersallowing results in hours instead of the weeks or months it might

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

take to process on a smaller system56 This has particular applicationin the basic processing of a very large web archive There are limita-tions however jobs must be submitted into queues in advancerather than on-site processing and perhaps most importantly HPCcentres prefer to run highly optimized code to minimize usage oftheir very expensive and scarcely allocated equipment As this level ofcoding expertise is often beyond humanities researchers researchteams must be formed

Humanities researchers would do well to familiarize themselveswith local HPC options many of which are more than happy towork on new and engaging questions These are accessible optionsin Ontario for example all but one university belongs to a HPCconsortium57 Yet as personal computing continues to dramaticallyincrease in speed and as storage prices continue to plummet textualanalysis is increasingly feasible on high-quality office systems An ini-tial exposure can generate questions and problems that if the needrears itself can subsequently be taken to the supercomputer

This case study is intended to provide one model for distantreading and the generation of finding aids in the interest of showingsome of the potential for working with born-digital sources It isintended to begin or jumpstart a conversation around the resourcesand skills necessary for the next generation A seemingly insur-mountable amount of information contained across over 73000files (as in the CDC collection) can be quickly navigated and iso-lated for researcher convenience

Conclusion Digital Literacy for the Next Generation ofHistorians

As a discipline I would like to suggest one path forward prepare thenext generation of historians at least some of them to deal with thedeluge of born-digital sources Fundamentally the challenge of born-digital sources is a pedagogical one For existing historians this is anissue of continual education for future historians an issue of profes-sional training (the annual Digital Humanities Summer Instituteheld at the University of Victoria each June is worth considering asan introduction to many digital concepts) Some off-the-shelf tools

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

can serve as gateways to deeper engagement notably the VoyantTools suite (wwwvoyant-toolsorg) and perhaps subsequently thelessons contained in the Programming Historian 2 That said digitalliteracy courses should become part of the essential pantheon of his-torical curricula This should not be mandatory but instead taughtalongside and within traditional historiography and research meth-ods courses In an era of diminishing resources and attention paid toknowledge translation and relevancy digital literacy courses are alsowell positioned to provide graduates with concrete portfolios andactionable skills If we want to be prepared to use digital method-ologies to study the post-1990 world we will need ready historians

Such a course should have six basic priorities First we must pro-vide historiographical context for new students on whatcomputational methods have been tried before How have quantita-tive and qualitative methods differed yet ultimately complementedeach other Second students in this course would need to develop anunderstanding of the digital basics cloud computing effectivebackup and versioning digital security and an understanding of howthe Internet is organized Third for effective cross-departmental andsustained engagement with digital history students must to learn howto digitally navigate conventional sources notable skill sets wouldinclude a familiarity with citation management software databasesand the possibility of using optical character recognition on a varietyof born-paper sources subsequently digitized through scanners orcameras This must also include a grasp of the scope and significanceof large digital depositories including Google Books the InternetArchive and others mentioned earlier Fourth students must learnhow to process large amounts of digital information this includes off-the-shelf visualization tools material discussed here and some of thephilosophical and methodological implications of a distance readingFifth students can learn basic programming through the ProgrammingHistorian Finally and perhaps most importantly students need anunderstanding of various ways to openly disseminate material on the web whether through blogging micro-blogging publishing webpages and so forth it is important to enable the great work that we do in the university to make its mark on the world Throughthis students will also come away from history departments with

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

comprehensive and marketable digital portfolios Such a coursewould focus more on homework than reading with special emphasisplaced on deliverable content such as websites interactive modelsand broadly implementable databases

Digital history and the readiness to work with born-digitalsources cannot stem out of pedagogy alone however Institutionalsupport is necessary While based on anecdotal evidence it appearsthat some university administrators have begun to embrace the shifttowards the digital humanities At the federal level SSHRC has iden-tified the Digital Economy as one of its priority areas58 Yet when itcomes to evaluation doctoral and post-doctoral researchers do nothave a ldquodigital humanitiesrdquo option open to them in terms of assessingcommittees leading to fears that their projects might be lost inbetween disciplinary areas of focus59 In light of the impending crisisof born-digital sources historians should encourage these initiatives

These sorts of digital projects do not always lend themselveswell to traditional publication venues The examples above from theCDC case study for example are best viewed as dynamically mal-leable models this will be one challenge as historians grapple withnew publication models This article argues first why digitalmethodologies are necessary and second how we can realize aspectsof this call The Internet is 20 years old and before historians realizeit we will be faced with the crisis of born-digital information over-load If history is to continue as the leading discipline inunderstanding the social and cultural past decisive movementtowards the digital humanities is necessary When the sea comes inwe must be ready to swim

IAN MILLIGAN is an assistant professor of Canadian history at theUniversity of Waterloo He is a founding co-editor of ActiveHistorycaan editor-at-large of the Programming Historian 2 and has written sev-eral articles in the fields of Canadian youth digital and labour history

IAN MILLIGAN est professeur adjoint en histoire canadienne agrave laUniversity of Waterloo Il est le fondateur et codirecteur du site

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Internet ActiveHistoryca ainsi qursquoun eacutediteur du site ProgrammingHistorian 2 Il a eacutecrit plusieurs articles au sujet de la jeunesse cana-dienne de la numeacuterisation et de lrsquohistoire des travailleurs

Endnotes

1 The question of when the past becomes a ldquohistoryrdquo to be professionallystudied is a tricky one Yet it is worth noting that the still-definitivework on the Baby Boom Generation appeared in 1996 with DougOwram Born at the Right Time A History of the Baby Boom Generationin Canada (Toronto University of Toronto Press 1996) This actuallyfollowed the earlier scholarly monograph Cyril Levitt Children ofPrivilege Student Revolt in the Sixties (Toronto University of TorontoPress 1984) and a host of other laudable works As academicdebatesworks on the 1960s now suggest the past becomes historysooner than we often expect

2 A critical point advanced in the field of literary criticism and theory byStephen Ramsay Reading Machines Towards an Algorithmic Criticism(Urbana University of Illinois Press 2011)

3 See ldquoIdleNoMore Timelinerdquo ongoingwwwidlenomoremakookcatimeline44 ltviewed 18 January 2013gt

4 These issues have been discussed elsewhere albeit in either a moreabridged form or with a focus on digitized historical sources A notableexception is Daniel J Cohen and Roy Rosenzweig Digital History AGuide to Gathering Preserving and Presenting the Past on the Web(Philadelphia University of Pennsylvania Press 2005) specifically theirchapter on ldquoCollecting History onlinerdquo Their focus is on the preservation-ist and archival challenges however Important notes are also found inDan Cohen ldquoHistory and the Second Decade of the Webrdquo RethinkingHistory 8 no 2 (June 2004) 293ndash301 also available online atwwwchnmgmueduessays-on-history-new-mediaessaysessayid=34

5 Figure taken from Jose Martinez ldquoTwitter CEO Dick Costolo Reveals Staggering Number of Tweets Per Dayrdquo ComplexTech wwwcomplexcomtech201210twitter-ceo-dick-costolo-reveals-staggering-number-of-tweets-per-day ltviewed 6 January 2013gtFurther more while a tweet itself is only 140 characters every tweetcomes with over 40 lines of metadata date information author biogra-phy and location information when the account was created the usersrsquofollowing and follower numbers country data and so forth This is aremarkable treasure trove of information for future historians but also apotential source of information overload See Sarah Perez ldquoThis is What

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

a Tweet Looks Likerdquo ReadWriteWeb (19 April 2010) wwwreadwritewebcomarchivesthis_is_what_a_tweet_looks_likephpltviewed 6 January 2013gt

6 This is further discussed in Ian Foster ldquoHow Computation ChangesResearchrdquo in Switching Codes Thinking Through Digital Technology inthe Humanities and the Arts eds Thomas Bartscherer and RoderickCoover (Chicago University of Chicago Press 2011) esp 22ndash4

7 Franco Moretti Graphs Maps Trees Abstract Models for Literary History(New York Verso 2005) 3ndash4

8 While not of prime focus here the longue dureacutee deserves definition Itstands opposite the history of events of instances covering instead avery long time span in an interdisciplinary social science framework In along essay Braudel noted that one can then see both structural crises ofeconomies and structures that constrain human society and develop-ment For more see Fernand Braudel ldquoHistory and the Social SciencesThe Longue Dureacuteerdquo in Fernand Braudel On History trans SarahMatthews (Chicago University of Chicago Press 1980) 27 The essaywas originally printed in the Annales ESC no 4 (OctoberndashDecember1958)

9 A great overview can be found at ldquoText Analysisrdquo Tooling Up for DigitalHumanities 2011 wwwtoolingupstanfordedupage_id=981 ltviewed19 December 2012gt

10 Jean-Baptiste Michel Yuan Kui Shen Aviva Presser Aiden Adrian VeresMatthew K Gray The Google Books Team Joseph P Pickett DaleHoiberg Dan Clancy Peter Norvig Jon Orwant Steven Pinker MartinA Nowak Erez Lieberman Aiden ldquoQuantitative Analysis of CultureUsing Millions of Digitized Booksrdquo Science 331 (14 January 2011)176ndash82 The tool is publicly accessible at ldquoGoogle Books NgramViewerrdquo Google Books 2010 wwwbooksgooglecomngrams ltviewed11 January 2013gt

11 Anthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

12 Comment by Jean-Baptiste Michel and Erez Lieberman Aiden onAnthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

13 For more information see the Canadian Century Research InfrastructureProject website at wwwccriuottawacaCCRIHomehtml and the Programme de recherche en deacutemographie historique at wwwgeneologyumontrealcafrleprdhhtm

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

14 ldquoWith Criminal Intentrdquo CriminalIntentorg wwwcriminalintentorgltviewed 20 December 2012gt

15 For an overview see ldquoAward Recipients mdash 2011rdquo Digging into DataWebpage wwwdiggingintodataorgHomeAwardRecipients2011tabid185Defaultaspx ltviewed 20 December 2012gt

16 One can do no better than the completely free Programming Historianwhich uses entirely free and open-source programming tools to learn thebasics of computational history See The Programming Historian 2 2012-Present wwwprogramminghistorianorg ltviewed 11 January 2013gt

17 Michael Katz The People of Hamilton Canada West Family and Class ina Mid-Nineteenth-Century City (Cambridge MA Harvard UniversityPress 1975) Also see A Gordon Darroch and Michael D OrnsteinldquoEthnicity and Occupational Structure in Canada in 1871 The VerticalMosaic in Historical Perspectiverdquo Canadian Historical Review 61 no 3(1980) 305ndash33

18 Ian Anderson ldquoHistory and Computingrdquo Making History The ChangingFace of the Profession in Britain 2008 wwwhistoryacukmakinghistoryresourcesarticleshistory_and_computinghtml ltviewed 20 December2012gt

19 Christopher Collins ldquoText and Document Visualizationrdquo TheHumanities and Technology Camp Greater Toronto Area VisualizationBootcamp University of Ontario Institute of Technology Oshawa Ont21 October 2011

20 The quotations by Luther and Poe are commonly used to make thispoint For the best example see Clay Shirky Cognitive SurplusCreativity and Generosity in a Connected Age Google eBook Penguin2010 For Mumford see Lewis Mumford The Myth of the Machine vol2 The Pentagon of Power (New York Harcourt Brace 1970) 182 asquoted in Gleick The Information A History A Theory A Flood (NewYork Pantheon 2011) 404

21 Adam Ostrow ldquoCisco CRS-3 Download Library of Congress in JustOver One Secondrdquo Mashable (9 March 2010) accessed onlinewwwmashablecom20100309cisco-crs-3

22 A note on data size is necessary The basic unit of information is a ldquobyterdquo (8 bits) which is roughly a single character A kilobyte (KB) is athousand bytes a megabyte (MB) a million a gigabyte (GB) a billion aterabyte (TB) a trillion and a petabyte (PB) a quadrillion Beyond thisthere are exabytes (EB) zettabytes (ZB) and yottabytes (YB)

23 These figures can of course vary depending on compression methodschosen Their math was based on an average of 8MB per book if it wasscanned at 600dpi and then compressed There are 26 million books inthe Library of Congress For more see Mike Ashenfelder ldquoTransferringlsquoLibraries of Congressrsquo of Datardquo Library of Congress Digital Preservation

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Blog 11 July 2011wwwblogslocgovdigitalpreservation201107transferring-libraries-of-congress-of-data ltviewed 11 January 2013gt

24 Gleick 232 25 ldquoAbout the Library of Congress Web Archivesrdquo Library of Congress Web

Archiving FAQs Undated wwwlocgovwebarchivingfaqhtmlfaqs_02ltviewed 11 January 2013gt

26 ldquo10000000000000000 Bytes Archivedrdquo Internet Archive Blog 26October 2012 wwwblogarchiveorg2012102610000000000000000-bytes-archivedltviewed 19 December 2012gt

27 Jean-Baptiste Michel et al 176ndash82 Published Online Ahead of Print16 December 2010 wwwsciencemagorgcontent3316014176ltviewed 29 July 2011gt See also Scott Cleland ldquoGooglersquos PiracyLiabilityrdquo Forbes 9 November 2011 wwwforbescomsitesscottcle-land20111109googles-piracy-liability ltviewed 11 January 2013gt

28 ldquoGovernment of Canada Web Archiverdquo Library and Archives Canada 17October 2007 wwwcollectionscanadagccawebarchivesindex-ehtmlltviewed 11 January 2013gt

29 For background see Bill Curry ldquoVisiting Library and Archives inOttawa Not Without an Appointmentrdquo Globe and Mail 1 May 2012wwwtheglobeandmailcomnewspoliticsottawa-notebookvisiting-library-and-archives-in-ottawa-not-without-an-appointmentarticle2418960 ltviewed 22 May 2012gt and the announcement ldquoLAC Begins Implementation of New Approach to service Deliveryrdquo Library and Archives Canada February 2012wwwcollectionscanadagccawhats-new013-560-ehtml ltviewed 22May 2012gt For professional responses see the ldquoCHArsquos comment onNew Directions for Library and Archives Canadardquo CHA Website March2012 wwwcha-shccaenNews_39items19html ltviewed 19December 2012gt and ldquoSave Library amp Archives Canadardquo CanadianAssociation of University Teachers launched AprilMay 2012 wwwsavelibraryarchivesca ltviewed 19 December 2012gt

30 Ian Milligan ldquoThe Smokescreen of lsquoModernizationrsquo at Library andArchives Canadardquo ActiveHistoryca 22 May 2012wwwactivehistoryca201205the-smokescreen-of-modernization-at-library-and-archives-canada ltviewed 19 December 2012gt

31 ldquoSave LAC May 2012 Campaign Updaterdquo SaveLibraryArchivesca May2012 wwwsavelibraryarchivescaupdate-2012-05aspx ltviewed 19December 2012gt

32 The challenges and opportunities presented by such projects is extensively discussed in Daniel J Cohen ldquoThe Future of Preserving the Pastrdquo CRM The Journal of Heritage Stewardship 2 no 2

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

(Summer 2005) 6ndash19 also available freely online at the RoyRosenzweig Center for History and New Mediawwwchnmgmueduessays-on-history-new-mediaessaysessayid=39For more see ldquoSeptember 11 Digital Archive Saving the Histories of September 11 2001rdquo Centre for History and New Media andAmerican Social History Project Center for Media and Learningwww911digitalarchiveorg ltviewed 11 January 2013gt ldquoHurricaneDigital Memory Bank Collecting and Preserving the Stories of Katrina and Ritardquo Center for History and New Media wwwhurricanearchiveorg ltviewed 11 January 2013gt and ldquoOccupyArchive Archiving the Occupy Movements from 2011rdquo Centre forHistory and New Media wwwoccupyarchiveorg ltviewed 11 January2013gt

33 Daniel Cohen and Roy Rosenzweig Digital History A Guide to Gathering Preserving and Presenting the Past on the Web(Philadelphia Temple University Press 2005) chapter on ldquoPreservingDigital Historyrdquo The chapter is available fully online atwwwchnmgmuedudigitalhistorypreserving5php ltviewed 19December 2012gt

34 For an overview see ldquoPreserving Our Digital Heritage The NationalDigital Information Infrastructure and Preservation Program 2010 Report a Collaborative Initiative of the Library of CongressrdquoDigitalpreservationgov wwwdigitalpreservationgovmultimediadocumentsNDIIPP2010Report_Postpdf ltviewed 19 December 2012gt

35 Steve Meloan ldquoNo Way to Run a Culturerdquo Wired 13 February 1998available online via the Internet Archivewwwwebarchiveorgweb20000619001705httpwwwwiredcomnewsculture012841030100html ltviewed 19 December 2012gt

36 Some of this has been discussed in greater depth on my website See theseries beginning with Ian Milligan ldquoWARC Files A Challenge forHistorians and Finding Needles in Haystacksrdquo ianmilliganca 12December 2012 wwwianmilliganca20121212warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks ltviewed 20 December2012gt

37 Unstructured data as opposed to structured data is important for histo-rians but brings additional challenges Some projects have producedstructured data novels marked up using XML encoding for example soone can immediately see place names dates character names and soforth This makes it much easier for a computer to read and understand

38 See William Noblett ldquoDigitization A Cautionary Talerdquo New Review ofAcademic Librarianship 17 (2011) 3 Tobias Blanke Michael Bryantand Mark Hedges ldquoOcropodium Open Source OCR for Small-ScaleHistorical Archivesrdquo Journal of Information Science 38 no 1 (2012) 83

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 34: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

At a very quick glance one can see the major words that make this aunique document This technique can be used on any textual docu-ment

Off-the-shelf tools can facilitate a deeper yet distant reading ofindividual websites as well especially the suite of research tools onwebsites such as Many Eyes and Voyant Using these tools howeverrequires that one puts individual data into each separate programrather than quickly navigating between them This stems from themassive amount of files and data involved Many Eyes limits files to5MB and Voyant has no hard and fast rule but can struggle withlarger files

This analysis was carried out on my personal computer55 Largerdata sets may necessitate a corresponding increase in computationalpower bringing social sciences and humanities researchers into con-tact with High Performance Computing (HPC) centres or consortiaThese extremely powerful computers provide researchers with com-puting power orders of magnitude above that of personal computersallowing results in hours instead of the weeks or months it might

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

take to process on a smaller system56 This has particular applicationin the basic processing of a very large web archive There are limita-tions however jobs must be submitted into queues in advancerather than on-site processing and perhaps most importantly HPCcentres prefer to run highly optimized code to minimize usage oftheir very expensive and scarcely allocated equipment As this level ofcoding expertise is often beyond humanities researchers researchteams must be formed

Humanities researchers would do well to familiarize themselveswith local HPC options many of which are more than happy towork on new and engaging questions These are accessible optionsin Ontario for example all but one university belongs to a HPCconsortium57 Yet as personal computing continues to dramaticallyincrease in speed and as storage prices continue to plummet textualanalysis is increasingly feasible on high-quality office systems An ini-tial exposure can generate questions and problems that if the needrears itself can subsequently be taken to the supercomputer

This case study is intended to provide one model for distantreading and the generation of finding aids in the interest of showingsome of the potential for working with born-digital sources It isintended to begin or jumpstart a conversation around the resourcesand skills necessary for the next generation A seemingly insur-mountable amount of information contained across over 73000files (as in the CDC collection) can be quickly navigated and iso-lated for researcher convenience

Conclusion Digital Literacy for the Next Generation ofHistorians

As a discipline I would like to suggest one path forward prepare thenext generation of historians at least some of them to deal with thedeluge of born-digital sources Fundamentally the challenge of born-digital sources is a pedagogical one For existing historians this is anissue of continual education for future historians an issue of profes-sional training (the annual Digital Humanities Summer Instituteheld at the University of Victoria each June is worth considering asan introduction to many digital concepts) Some off-the-shelf tools

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

can serve as gateways to deeper engagement notably the VoyantTools suite (wwwvoyant-toolsorg) and perhaps subsequently thelessons contained in the Programming Historian 2 That said digitalliteracy courses should become part of the essential pantheon of his-torical curricula This should not be mandatory but instead taughtalongside and within traditional historiography and research meth-ods courses In an era of diminishing resources and attention paid toknowledge translation and relevancy digital literacy courses are alsowell positioned to provide graduates with concrete portfolios andactionable skills If we want to be prepared to use digital method-ologies to study the post-1990 world we will need ready historians

Such a course should have six basic priorities First we must pro-vide historiographical context for new students on whatcomputational methods have been tried before How have quantita-tive and qualitative methods differed yet ultimately complementedeach other Second students in this course would need to develop anunderstanding of the digital basics cloud computing effectivebackup and versioning digital security and an understanding of howthe Internet is organized Third for effective cross-departmental andsustained engagement with digital history students must to learn howto digitally navigate conventional sources notable skill sets wouldinclude a familiarity with citation management software databasesand the possibility of using optical character recognition on a varietyof born-paper sources subsequently digitized through scanners orcameras This must also include a grasp of the scope and significanceof large digital depositories including Google Books the InternetArchive and others mentioned earlier Fourth students must learnhow to process large amounts of digital information this includes off-the-shelf visualization tools material discussed here and some of thephilosophical and methodological implications of a distance readingFifth students can learn basic programming through the ProgrammingHistorian Finally and perhaps most importantly students need anunderstanding of various ways to openly disseminate material on the web whether through blogging micro-blogging publishing webpages and so forth it is important to enable the great work that we do in the university to make its mark on the world Throughthis students will also come away from history departments with

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

comprehensive and marketable digital portfolios Such a coursewould focus more on homework than reading with special emphasisplaced on deliverable content such as websites interactive modelsand broadly implementable databases

Digital history and the readiness to work with born-digitalsources cannot stem out of pedagogy alone however Institutionalsupport is necessary While based on anecdotal evidence it appearsthat some university administrators have begun to embrace the shifttowards the digital humanities At the federal level SSHRC has iden-tified the Digital Economy as one of its priority areas58 Yet when itcomes to evaluation doctoral and post-doctoral researchers do nothave a ldquodigital humanitiesrdquo option open to them in terms of assessingcommittees leading to fears that their projects might be lost inbetween disciplinary areas of focus59 In light of the impending crisisof born-digital sources historians should encourage these initiatives

These sorts of digital projects do not always lend themselveswell to traditional publication venues The examples above from theCDC case study for example are best viewed as dynamically mal-leable models this will be one challenge as historians grapple withnew publication models This article argues first why digitalmethodologies are necessary and second how we can realize aspectsof this call The Internet is 20 years old and before historians realizeit we will be faced with the crisis of born-digital information over-load If history is to continue as the leading discipline inunderstanding the social and cultural past decisive movementtowards the digital humanities is necessary When the sea comes inwe must be ready to swim

IAN MILLIGAN is an assistant professor of Canadian history at theUniversity of Waterloo He is a founding co-editor of ActiveHistorycaan editor-at-large of the Programming Historian 2 and has written sev-eral articles in the fields of Canadian youth digital and labour history

IAN MILLIGAN est professeur adjoint en histoire canadienne agrave laUniversity of Waterloo Il est le fondateur et codirecteur du site

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Internet ActiveHistoryca ainsi qursquoun eacutediteur du site ProgrammingHistorian 2 Il a eacutecrit plusieurs articles au sujet de la jeunesse cana-dienne de la numeacuterisation et de lrsquohistoire des travailleurs

Endnotes

1 The question of when the past becomes a ldquohistoryrdquo to be professionallystudied is a tricky one Yet it is worth noting that the still-definitivework on the Baby Boom Generation appeared in 1996 with DougOwram Born at the Right Time A History of the Baby Boom Generationin Canada (Toronto University of Toronto Press 1996) This actuallyfollowed the earlier scholarly monograph Cyril Levitt Children ofPrivilege Student Revolt in the Sixties (Toronto University of TorontoPress 1984) and a host of other laudable works As academicdebatesworks on the 1960s now suggest the past becomes historysooner than we often expect

2 A critical point advanced in the field of literary criticism and theory byStephen Ramsay Reading Machines Towards an Algorithmic Criticism(Urbana University of Illinois Press 2011)

3 See ldquoIdleNoMore Timelinerdquo ongoingwwwidlenomoremakookcatimeline44 ltviewed 18 January 2013gt

4 These issues have been discussed elsewhere albeit in either a moreabridged form or with a focus on digitized historical sources A notableexception is Daniel J Cohen and Roy Rosenzweig Digital History AGuide to Gathering Preserving and Presenting the Past on the Web(Philadelphia University of Pennsylvania Press 2005) specifically theirchapter on ldquoCollecting History onlinerdquo Their focus is on the preservation-ist and archival challenges however Important notes are also found inDan Cohen ldquoHistory and the Second Decade of the Webrdquo RethinkingHistory 8 no 2 (June 2004) 293ndash301 also available online atwwwchnmgmueduessays-on-history-new-mediaessaysessayid=34

5 Figure taken from Jose Martinez ldquoTwitter CEO Dick Costolo Reveals Staggering Number of Tweets Per Dayrdquo ComplexTech wwwcomplexcomtech201210twitter-ceo-dick-costolo-reveals-staggering-number-of-tweets-per-day ltviewed 6 January 2013gtFurther more while a tweet itself is only 140 characters every tweetcomes with over 40 lines of metadata date information author biogra-phy and location information when the account was created the usersrsquofollowing and follower numbers country data and so forth This is aremarkable treasure trove of information for future historians but also apotential source of information overload See Sarah Perez ldquoThis is What

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

a Tweet Looks Likerdquo ReadWriteWeb (19 April 2010) wwwreadwritewebcomarchivesthis_is_what_a_tweet_looks_likephpltviewed 6 January 2013gt

6 This is further discussed in Ian Foster ldquoHow Computation ChangesResearchrdquo in Switching Codes Thinking Through Digital Technology inthe Humanities and the Arts eds Thomas Bartscherer and RoderickCoover (Chicago University of Chicago Press 2011) esp 22ndash4

7 Franco Moretti Graphs Maps Trees Abstract Models for Literary History(New York Verso 2005) 3ndash4

8 While not of prime focus here the longue dureacutee deserves definition Itstands opposite the history of events of instances covering instead avery long time span in an interdisciplinary social science framework In along essay Braudel noted that one can then see both structural crises ofeconomies and structures that constrain human society and develop-ment For more see Fernand Braudel ldquoHistory and the Social SciencesThe Longue Dureacuteerdquo in Fernand Braudel On History trans SarahMatthews (Chicago University of Chicago Press 1980) 27 The essaywas originally printed in the Annales ESC no 4 (OctoberndashDecember1958)

9 A great overview can be found at ldquoText Analysisrdquo Tooling Up for DigitalHumanities 2011 wwwtoolingupstanfordedupage_id=981 ltviewed19 December 2012gt

10 Jean-Baptiste Michel Yuan Kui Shen Aviva Presser Aiden Adrian VeresMatthew K Gray The Google Books Team Joseph P Pickett DaleHoiberg Dan Clancy Peter Norvig Jon Orwant Steven Pinker MartinA Nowak Erez Lieberman Aiden ldquoQuantitative Analysis of CultureUsing Millions of Digitized Booksrdquo Science 331 (14 January 2011)176ndash82 The tool is publicly accessible at ldquoGoogle Books NgramViewerrdquo Google Books 2010 wwwbooksgooglecomngrams ltviewed11 January 2013gt

11 Anthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

12 Comment by Jean-Baptiste Michel and Erez Lieberman Aiden onAnthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

13 For more information see the Canadian Century Research InfrastructureProject website at wwwccriuottawacaCCRIHomehtml and the Programme de recherche en deacutemographie historique at wwwgeneologyumontrealcafrleprdhhtm

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

14 ldquoWith Criminal Intentrdquo CriminalIntentorg wwwcriminalintentorgltviewed 20 December 2012gt

15 For an overview see ldquoAward Recipients mdash 2011rdquo Digging into DataWebpage wwwdiggingintodataorgHomeAwardRecipients2011tabid185Defaultaspx ltviewed 20 December 2012gt

16 One can do no better than the completely free Programming Historianwhich uses entirely free and open-source programming tools to learn thebasics of computational history See The Programming Historian 2 2012-Present wwwprogramminghistorianorg ltviewed 11 January 2013gt

17 Michael Katz The People of Hamilton Canada West Family and Class ina Mid-Nineteenth-Century City (Cambridge MA Harvard UniversityPress 1975) Also see A Gordon Darroch and Michael D OrnsteinldquoEthnicity and Occupational Structure in Canada in 1871 The VerticalMosaic in Historical Perspectiverdquo Canadian Historical Review 61 no 3(1980) 305ndash33

18 Ian Anderson ldquoHistory and Computingrdquo Making History The ChangingFace of the Profession in Britain 2008 wwwhistoryacukmakinghistoryresourcesarticleshistory_and_computinghtml ltviewed 20 December2012gt

19 Christopher Collins ldquoText and Document Visualizationrdquo TheHumanities and Technology Camp Greater Toronto Area VisualizationBootcamp University of Ontario Institute of Technology Oshawa Ont21 October 2011

20 The quotations by Luther and Poe are commonly used to make thispoint For the best example see Clay Shirky Cognitive SurplusCreativity and Generosity in a Connected Age Google eBook Penguin2010 For Mumford see Lewis Mumford The Myth of the Machine vol2 The Pentagon of Power (New York Harcourt Brace 1970) 182 asquoted in Gleick The Information A History A Theory A Flood (NewYork Pantheon 2011) 404

21 Adam Ostrow ldquoCisco CRS-3 Download Library of Congress in JustOver One Secondrdquo Mashable (9 March 2010) accessed onlinewwwmashablecom20100309cisco-crs-3

22 A note on data size is necessary The basic unit of information is a ldquobyterdquo (8 bits) which is roughly a single character A kilobyte (KB) is athousand bytes a megabyte (MB) a million a gigabyte (GB) a billion aterabyte (TB) a trillion and a petabyte (PB) a quadrillion Beyond thisthere are exabytes (EB) zettabytes (ZB) and yottabytes (YB)

23 These figures can of course vary depending on compression methodschosen Their math was based on an average of 8MB per book if it wasscanned at 600dpi and then compressed There are 26 million books inthe Library of Congress For more see Mike Ashenfelder ldquoTransferringlsquoLibraries of Congressrsquo of Datardquo Library of Congress Digital Preservation

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Blog 11 July 2011wwwblogslocgovdigitalpreservation201107transferring-libraries-of-congress-of-data ltviewed 11 January 2013gt

24 Gleick 232 25 ldquoAbout the Library of Congress Web Archivesrdquo Library of Congress Web

Archiving FAQs Undated wwwlocgovwebarchivingfaqhtmlfaqs_02ltviewed 11 January 2013gt

26 ldquo10000000000000000 Bytes Archivedrdquo Internet Archive Blog 26October 2012 wwwblogarchiveorg2012102610000000000000000-bytes-archivedltviewed 19 December 2012gt

27 Jean-Baptiste Michel et al 176ndash82 Published Online Ahead of Print16 December 2010 wwwsciencemagorgcontent3316014176ltviewed 29 July 2011gt See also Scott Cleland ldquoGooglersquos PiracyLiabilityrdquo Forbes 9 November 2011 wwwforbescomsitesscottcle-land20111109googles-piracy-liability ltviewed 11 January 2013gt

28 ldquoGovernment of Canada Web Archiverdquo Library and Archives Canada 17October 2007 wwwcollectionscanadagccawebarchivesindex-ehtmlltviewed 11 January 2013gt

29 For background see Bill Curry ldquoVisiting Library and Archives inOttawa Not Without an Appointmentrdquo Globe and Mail 1 May 2012wwwtheglobeandmailcomnewspoliticsottawa-notebookvisiting-library-and-archives-in-ottawa-not-without-an-appointmentarticle2418960 ltviewed 22 May 2012gt and the announcement ldquoLAC Begins Implementation of New Approach to service Deliveryrdquo Library and Archives Canada February 2012wwwcollectionscanadagccawhats-new013-560-ehtml ltviewed 22May 2012gt For professional responses see the ldquoCHArsquos comment onNew Directions for Library and Archives Canadardquo CHA Website March2012 wwwcha-shccaenNews_39items19html ltviewed 19December 2012gt and ldquoSave Library amp Archives Canadardquo CanadianAssociation of University Teachers launched AprilMay 2012 wwwsavelibraryarchivesca ltviewed 19 December 2012gt

30 Ian Milligan ldquoThe Smokescreen of lsquoModernizationrsquo at Library andArchives Canadardquo ActiveHistoryca 22 May 2012wwwactivehistoryca201205the-smokescreen-of-modernization-at-library-and-archives-canada ltviewed 19 December 2012gt

31 ldquoSave LAC May 2012 Campaign Updaterdquo SaveLibraryArchivesca May2012 wwwsavelibraryarchivescaupdate-2012-05aspx ltviewed 19December 2012gt

32 The challenges and opportunities presented by such projects is extensively discussed in Daniel J Cohen ldquoThe Future of Preserving the Pastrdquo CRM The Journal of Heritage Stewardship 2 no 2

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

(Summer 2005) 6ndash19 also available freely online at the RoyRosenzweig Center for History and New Mediawwwchnmgmueduessays-on-history-new-mediaessaysessayid=39For more see ldquoSeptember 11 Digital Archive Saving the Histories of September 11 2001rdquo Centre for History and New Media andAmerican Social History Project Center for Media and Learningwww911digitalarchiveorg ltviewed 11 January 2013gt ldquoHurricaneDigital Memory Bank Collecting and Preserving the Stories of Katrina and Ritardquo Center for History and New Media wwwhurricanearchiveorg ltviewed 11 January 2013gt and ldquoOccupyArchive Archiving the Occupy Movements from 2011rdquo Centre forHistory and New Media wwwoccupyarchiveorg ltviewed 11 January2013gt

33 Daniel Cohen and Roy Rosenzweig Digital History A Guide to Gathering Preserving and Presenting the Past on the Web(Philadelphia Temple University Press 2005) chapter on ldquoPreservingDigital Historyrdquo The chapter is available fully online atwwwchnmgmuedudigitalhistorypreserving5php ltviewed 19December 2012gt

34 For an overview see ldquoPreserving Our Digital Heritage The NationalDigital Information Infrastructure and Preservation Program 2010 Report a Collaborative Initiative of the Library of CongressrdquoDigitalpreservationgov wwwdigitalpreservationgovmultimediadocumentsNDIIPP2010Report_Postpdf ltviewed 19 December 2012gt

35 Steve Meloan ldquoNo Way to Run a Culturerdquo Wired 13 February 1998available online via the Internet Archivewwwwebarchiveorgweb20000619001705httpwwwwiredcomnewsculture012841030100html ltviewed 19 December 2012gt

36 Some of this has been discussed in greater depth on my website See theseries beginning with Ian Milligan ldquoWARC Files A Challenge forHistorians and Finding Needles in Haystacksrdquo ianmilliganca 12December 2012 wwwianmilliganca20121212warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks ltviewed 20 December2012gt

37 Unstructured data as opposed to structured data is important for histo-rians but brings additional challenges Some projects have producedstructured data novels marked up using XML encoding for example soone can immediately see place names dates character names and soforth This makes it much easier for a computer to read and understand

38 See William Noblett ldquoDigitization A Cautionary Talerdquo New Review ofAcademic Librarianship 17 (2011) 3 Tobias Blanke Michael Bryantand Mark Hedges ldquoOcropodium Open Source OCR for Small-ScaleHistorical Archivesrdquo Journal of Information Science 38 no 1 (2012) 83

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 35: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

take to process on a smaller system56 This has particular applicationin the basic processing of a very large web archive There are limita-tions however jobs must be submitted into queues in advancerather than on-site processing and perhaps most importantly HPCcentres prefer to run highly optimized code to minimize usage oftheir very expensive and scarcely allocated equipment As this level ofcoding expertise is often beyond humanities researchers researchteams must be formed

Humanities researchers would do well to familiarize themselveswith local HPC options many of which are more than happy towork on new and engaging questions These are accessible optionsin Ontario for example all but one university belongs to a HPCconsortium57 Yet as personal computing continues to dramaticallyincrease in speed and as storage prices continue to plummet textualanalysis is increasingly feasible on high-quality office systems An ini-tial exposure can generate questions and problems that if the needrears itself can subsequently be taken to the supercomputer

This case study is intended to provide one model for distantreading and the generation of finding aids in the interest of showingsome of the potential for working with born-digital sources It isintended to begin or jumpstart a conversation around the resourcesand skills necessary for the next generation A seemingly insur-mountable amount of information contained across over 73000files (as in the CDC collection) can be quickly navigated and iso-lated for researcher convenience

Conclusion Digital Literacy for the Next Generation ofHistorians

As a discipline I would like to suggest one path forward prepare thenext generation of historians at least some of them to deal with thedeluge of born-digital sources Fundamentally the challenge of born-digital sources is a pedagogical one For existing historians this is anissue of continual education for future historians an issue of profes-sional training (the annual Digital Humanities Summer Instituteheld at the University of Victoria each June is worth considering asan introduction to many digital concepts) Some off-the-shelf tools

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

can serve as gateways to deeper engagement notably the VoyantTools suite (wwwvoyant-toolsorg) and perhaps subsequently thelessons contained in the Programming Historian 2 That said digitalliteracy courses should become part of the essential pantheon of his-torical curricula This should not be mandatory but instead taughtalongside and within traditional historiography and research meth-ods courses In an era of diminishing resources and attention paid toknowledge translation and relevancy digital literacy courses are alsowell positioned to provide graduates with concrete portfolios andactionable skills If we want to be prepared to use digital method-ologies to study the post-1990 world we will need ready historians

Such a course should have six basic priorities First we must pro-vide historiographical context for new students on whatcomputational methods have been tried before How have quantita-tive and qualitative methods differed yet ultimately complementedeach other Second students in this course would need to develop anunderstanding of the digital basics cloud computing effectivebackup and versioning digital security and an understanding of howthe Internet is organized Third for effective cross-departmental andsustained engagement with digital history students must to learn howto digitally navigate conventional sources notable skill sets wouldinclude a familiarity with citation management software databasesand the possibility of using optical character recognition on a varietyof born-paper sources subsequently digitized through scanners orcameras This must also include a grasp of the scope and significanceof large digital depositories including Google Books the InternetArchive and others mentioned earlier Fourth students must learnhow to process large amounts of digital information this includes off-the-shelf visualization tools material discussed here and some of thephilosophical and methodological implications of a distance readingFifth students can learn basic programming through the ProgrammingHistorian Finally and perhaps most importantly students need anunderstanding of various ways to openly disseminate material on the web whether through blogging micro-blogging publishing webpages and so forth it is important to enable the great work that we do in the university to make its mark on the world Throughthis students will also come away from history departments with

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

comprehensive and marketable digital portfolios Such a coursewould focus more on homework than reading with special emphasisplaced on deliverable content such as websites interactive modelsand broadly implementable databases

Digital history and the readiness to work with born-digitalsources cannot stem out of pedagogy alone however Institutionalsupport is necessary While based on anecdotal evidence it appearsthat some university administrators have begun to embrace the shifttowards the digital humanities At the federal level SSHRC has iden-tified the Digital Economy as one of its priority areas58 Yet when itcomes to evaluation doctoral and post-doctoral researchers do nothave a ldquodigital humanitiesrdquo option open to them in terms of assessingcommittees leading to fears that their projects might be lost inbetween disciplinary areas of focus59 In light of the impending crisisof born-digital sources historians should encourage these initiatives

These sorts of digital projects do not always lend themselveswell to traditional publication venues The examples above from theCDC case study for example are best viewed as dynamically mal-leable models this will be one challenge as historians grapple withnew publication models This article argues first why digitalmethodologies are necessary and second how we can realize aspectsof this call The Internet is 20 years old and before historians realizeit we will be faced with the crisis of born-digital information over-load If history is to continue as the leading discipline inunderstanding the social and cultural past decisive movementtowards the digital humanities is necessary When the sea comes inwe must be ready to swim

IAN MILLIGAN is an assistant professor of Canadian history at theUniversity of Waterloo He is a founding co-editor of ActiveHistorycaan editor-at-large of the Programming Historian 2 and has written sev-eral articles in the fields of Canadian youth digital and labour history

IAN MILLIGAN est professeur adjoint en histoire canadienne agrave laUniversity of Waterloo Il est le fondateur et codirecteur du site

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Internet ActiveHistoryca ainsi qursquoun eacutediteur du site ProgrammingHistorian 2 Il a eacutecrit plusieurs articles au sujet de la jeunesse cana-dienne de la numeacuterisation et de lrsquohistoire des travailleurs

Endnotes

1 The question of when the past becomes a ldquohistoryrdquo to be professionallystudied is a tricky one Yet it is worth noting that the still-definitivework on the Baby Boom Generation appeared in 1996 with DougOwram Born at the Right Time A History of the Baby Boom Generationin Canada (Toronto University of Toronto Press 1996) This actuallyfollowed the earlier scholarly monograph Cyril Levitt Children ofPrivilege Student Revolt in the Sixties (Toronto University of TorontoPress 1984) and a host of other laudable works As academicdebatesworks on the 1960s now suggest the past becomes historysooner than we often expect

2 A critical point advanced in the field of literary criticism and theory byStephen Ramsay Reading Machines Towards an Algorithmic Criticism(Urbana University of Illinois Press 2011)

3 See ldquoIdleNoMore Timelinerdquo ongoingwwwidlenomoremakookcatimeline44 ltviewed 18 January 2013gt

4 These issues have been discussed elsewhere albeit in either a moreabridged form or with a focus on digitized historical sources A notableexception is Daniel J Cohen and Roy Rosenzweig Digital History AGuide to Gathering Preserving and Presenting the Past on the Web(Philadelphia University of Pennsylvania Press 2005) specifically theirchapter on ldquoCollecting History onlinerdquo Their focus is on the preservation-ist and archival challenges however Important notes are also found inDan Cohen ldquoHistory and the Second Decade of the Webrdquo RethinkingHistory 8 no 2 (June 2004) 293ndash301 also available online atwwwchnmgmueduessays-on-history-new-mediaessaysessayid=34

5 Figure taken from Jose Martinez ldquoTwitter CEO Dick Costolo Reveals Staggering Number of Tweets Per Dayrdquo ComplexTech wwwcomplexcomtech201210twitter-ceo-dick-costolo-reveals-staggering-number-of-tweets-per-day ltviewed 6 January 2013gtFurther more while a tweet itself is only 140 characters every tweetcomes with over 40 lines of metadata date information author biogra-phy and location information when the account was created the usersrsquofollowing and follower numbers country data and so forth This is aremarkable treasure trove of information for future historians but also apotential source of information overload See Sarah Perez ldquoThis is What

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

a Tweet Looks Likerdquo ReadWriteWeb (19 April 2010) wwwreadwritewebcomarchivesthis_is_what_a_tweet_looks_likephpltviewed 6 January 2013gt

6 This is further discussed in Ian Foster ldquoHow Computation ChangesResearchrdquo in Switching Codes Thinking Through Digital Technology inthe Humanities and the Arts eds Thomas Bartscherer and RoderickCoover (Chicago University of Chicago Press 2011) esp 22ndash4

7 Franco Moretti Graphs Maps Trees Abstract Models for Literary History(New York Verso 2005) 3ndash4

8 While not of prime focus here the longue dureacutee deserves definition Itstands opposite the history of events of instances covering instead avery long time span in an interdisciplinary social science framework In along essay Braudel noted that one can then see both structural crises ofeconomies and structures that constrain human society and develop-ment For more see Fernand Braudel ldquoHistory and the Social SciencesThe Longue Dureacuteerdquo in Fernand Braudel On History trans SarahMatthews (Chicago University of Chicago Press 1980) 27 The essaywas originally printed in the Annales ESC no 4 (OctoberndashDecember1958)

9 A great overview can be found at ldquoText Analysisrdquo Tooling Up for DigitalHumanities 2011 wwwtoolingupstanfordedupage_id=981 ltviewed19 December 2012gt

10 Jean-Baptiste Michel Yuan Kui Shen Aviva Presser Aiden Adrian VeresMatthew K Gray The Google Books Team Joseph P Pickett DaleHoiberg Dan Clancy Peter Norvig Jon Orwant Steven Pinker MartinA Nowak Erez Lieberman Aiden ldquoQuantitative Analysis of CultureUsing Millions of Digitized Booksrdquo Science 331 (14 January 2011)176ndash82 The tool is publicly accessible at ldquoGoogle Books NgramViewerrdquo Google Books 2010 wwwbooksgooglecomngrams ltviewed11 January 2013gt

11 Anthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

12 Comment by Jean-Baptiste Michel and Erez Lieberman Aiden onAnthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

13 For more information see the Canadian Century Research InfrastructureProject website at wwwccriuottawacaCCRIHomehtml and the Programme de recherche en deacutemographie historique at wwwgeneologyumontrealcafrleprdhhtm

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

14 ldquoWith Criminal Intentrdquo CriminalIntentorg wwwcriminalintentorgltviewed 20 December 2012gt

15 For an overview see ldquoAward Recipients mdash 2011rdquo Digging into DataWebpage wwwdiggingintodataorgHomeAwardRecipients2011tabid185Defaultaspx ltviewed 20 December 2012gt

16 One can do no better than the completely free Programming Historianwhich uses entirely free and open-source programming tools to learn thebasics of computational history See The Programming Historian 2 2012-Present wwwprogramminghistorianorg ltviewed 11 January 2013gt

17 Michael Katz The People of Hamilton Canada West Family and Class ina Mid-Nineteenth-Century City (Cambridge MA Harvard UniversityPress 1975) Also see A Gordon Darroch and Michael D OrnsteinldquoEthnicity and Occupational Structure in Canada in 1871 The VerticalMosaic in Historical Perspectiverdquo Canadian Historical Review 61 no 3(1980) 305ndash33

18 Ian Anderson ldquoHistory and Computingrdquo Making History The ChangingFace of the Profession in Britain 2008 wwwhistoryacukmakinghistoryresourcesarticleshistory_and_computinghtml ltviewed 20 December2012gt

19 Christopher Collins ldquoText and Document Visualizationrdquo TheHumanities and Technology Camp Greater Toronto Area VisualizationBootcamp University of Ontario Institute of Technology Oshawa Ont21 October 2011

20 The quotations by Luther and Poe are commonly used to make thispoint For the best example see Clay Shirky Cognitive SurplusCreativity and Generosity in a Connected Age Google eBook Penguin2010 For Mumford see Lewis Mumford The Myth of the Machine vol2 The Pentagon of Power (New York Harcourt Brace 1970) 182 asquoted in Gleick The Information A History A Theory A Flood (NewYork Pantheon 2011) 404

21 Adam Ostrow ldquoCisco CRS-3 Download Library of Congress in JustOver One Secondrdquo Mashable (9 March 2010) accessed onlinewwwmashablecom20100309cisco-crs-3

22 A note on data size is necessary The basic unit of information is a ldquobyterdquo (8 bits) which is roughly a single character A kilobyte (KB) is athousand bytes a megabyte (MB) a million a gigabyte (GB) a billion aterabyte (TB) a trillion and a petabyte (PB) a quadrillion Beyond thisthere are exabytes (EB) zettabytes (ZB) and yottabytes (YB)

23 These figures can of course vary depending on compression methodschosen Their math was based on an average of 8MB per book if it wasscanned at 600dpi and then compressed There are 26 million books inthe Library of Congress For more see Mike Ashenfelder ldquoTransferringlsquoLibraries of Congressrsquo of Datardquo Library of Congress Digital Preservation

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Blog 11 July 2011wwwblogslocgovdigitalpreservation201107transferring-libraries-of-congress-of-data ltviewed 11 January 2013gt

24 Gleick 232 25 ldquoAbout the Library of Congress Web Archivesrdquo Library of Congress Web

Archiving FAQs Undated wwwlocgovwebarchivingfaqhtmlfaqs_02ltviewed 11 January 2013gt

26 ldquo10000000000000000 Bytes Archivedrdquo Internet Archive Blog 26October 2012 wwwblogarchiveorg2012102610000000000000000-bytes-archivedltviewed 19 December 2012gt

27 Jean-Baptiste Michel et al 176ndash82 Published Online Ahead of Print16 December 2010 wwwsciencemagorgcontent3316014176ltviewed 29 July 2011gt See also Scott Cleland ldquoGooglersquos PiracyLiabilityrdquo Forbes 9 November 2011 wwwforbescomsitesscottcle-land20111109googles-piracy-liability ltviewed 11 January 2013gt

28 ldquoGovernment of Canada Web Archiverdquo Library and Archives Canada 17October 2007 wwwcollectionscanadagccawebarchivesindex-ehtmlltviewed 11 January 2013gt

29 For background see Bill Curry ldquoVisiting Library and Archives inOttawa Not Without an Appointmentrdquo Globe and Mail 1 May 2012wwwtheglobeandmailcomnewspoliticsottawa-notebookvisiting-library-and-archives-in-ottawa-not-without-an-appointmentarticle2418960 ltviewed 22 May 2012gt and the announcement ldquoLAC Begins Implementation of New Approach to service Deliveryrdquo Library and Archives Canada February 2012wwwcollectionscanadagccawhats-new013-560-ehtml ltviewed 22May 2012gt For professional responses see the ldquoCHArsquos comment onNew Directions for Library and Archives Canadardquo CHA Website March2012 wwwcha-shccaenNews_39items19html ltviewed 19December 2012gt and ldquoSave Library amp Archives Canadardquo CanadianAssociation of University Teachers launched AprilMay 2012 wwwsavelibraryarchivesca ltviewed 19 December 2012gt

30 Ian Milligan ldquoThe Smokescreen of lsquoModernizationrsquo at Library andArchives Canadardquo ActiveHistoryca 22 May 2012wwwactivehistoryca201205the-smokescreen-of-modernization-at-library-and-archives-canada ltviewed 19 December 2012gt

31 ldquoSave LAC May 2012 Campaign Updaterdquo SaveLibraryArchivesca May2012 wwwsavelibraryarchivescaupdate-2012-05aspx ltviewed 19December 2012gt

32 The challenges and opportunities presented by such projects is extensively discussed in Daniel J Cohen ldquoThe Future of Preserving the Pastrdquo CRM The Journal of Heritage Stewardship 2 no 2

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

(Summer 2005) 6ndash19 also available freely online at the RoyRosenzweig Center for History and New Mediawwwchnmgmueduessays-on-history-new-mediaessaysessayid=39For more see ldquoSeptember 11 Digital Archive Saving the Histories of September 11 2001rdquo Centre for History and New Media andAmerican Social History Project Center for Media and Learningwww911digitalarchiveorg ltviewed 11 January 2013gt ldquoHurricaneDigital Memory Bank Collecting and Preserving the Stories of Katrina and Ritardquo Center for History and New Media wwwhurricanearchiveorg ltviewed 11 January 2013gt and ldquoOccupyArchive Archiving the Occupy Movements from 2011rdquo Centre forHistory and New Media wwwoccupyarchiveorg ltviewed 11 January2013gt

33 Daniel Cohen and Roy Rosenzweig Digital History A Guide to Gathering Preserving and Presenting the Past on the Web(Philadelphia Temple University Press 2005) chapter on ldquoPreservingDigital Historyrdquo The chapter is available fully online atwwwchnmgmuedudigitalhistorypreserving5php ltviewed 19December 2012gt

34 For an overview see ldquoPreserving Our Digital Heritage The NationalDigital Information Infrastructure and Preservation Program 2010 Report a Collaborative Initiative of the Library of CongressrdquoDigitalpreservationgov wwwdigitalpreservationgovmultimediadocumentsNDIIPP2010Report_Postpdf ltviewed 19 December 2012gt

35 Steve Meloan ldquoNo Way to Run a Culturerdquo Wired 13 February 1998available online via the Internet Archivewwwwebarchiveorgweb20000619001705httpwwwwiredcomnewsculture012841030100html ltviewed 19 December 2012gt

36 Some of this has been discussed in greater depth on my website See theseries beginning with Ian Milligan ldquoWARC Files A Challenge forHistorians and Finding Needles in Haystacksrdquo ianmilliganca 12December 2012 wwwianmilliganca20121212warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks ltviewed 20 December2012gt

37 Unstructured data as opposed to structured data is important for histo-rians but brings additional challenges Some projects have producedstructured data novels marked up using XML encoding for example soone can immediately see place names dates character names and soforth This makes it much easier for a computer to read and understand

38 See William Noblett ldquoDigitization A Cautionary Talerdquo New Review ofAcademic Librarianship 17 (2011) 3 Tobias Blanke Michael Bryantand Mark Hedges ldquoOcropodium Open Source OCR for Small-ScaleHistorical Archivesrdquo Journal of Information Science 38 no 1 (2012) 83

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 36: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

can serve as gateways to deeper engagement notably the VoyantTools suite (wwwvoyant-toolsorg) and perhaps subsequently thelessons contained in the Programming Historian 2 That said digitalliteracy courses should become part of the essential pantheon of his-torical curricula This should not be mandatory but instead taughtalongside and within traditional historiography and research meth-ods courses In an era of diminishing resources and attention paid toknowledge translation and relevancy digital literacy courses are alsowell positioned to provide graduates with concrete portfolios andactionable skills If we want to be prepared to use digital method-ologies to study the post-1990 world we will need ready historians

Such a course should have six basic priorities First we must pro-vide historiographical context for new students on whatcomputational methods have been tried before How have quantita-tive and qualitative methods differed yet ultimately complementedeach other Second students in this course would need to develop anunderstanding of the digital basics cloud computing effectivebackup and versioning digital security and an understanding of howthe Internet is organized Third for effective cross-departmental andsustained engagement with digital history students must to learn howto digitally navigate conventional sources notable skill sets wouldinclude a familiarity with citation management software databasesand the possibility of using optical character recognition on a varietyof born-paper sources subsequently digitized through scanners orcameras This must also include a grasp of the scope and significanceof large digital depositories including Google Books the InternetArchive and others mentioned earlier Fourth students must learnhow to process large amounts of digital information this includes off-the-shelf visualization tools material discussed here and some of thephilosophical and methodological implications of a distance readingFifth students can learn basic programming through the ProgrammingHistorian Finally and perhaps most importantly students need anunderstanding of various ways to openly disseminate material on the web whether through blogging micro-blogging publishing webpages and so forth it is important to enable the great work that we do in the university to make its mark on the world Throughthis students will also come away from history departments with

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

comprehensive and marketable digital portfolios Such a coursewould focus more on homework than reading with special emphasisplaced on deliverable content such as websites interactive modelsand broadly implementable databases

Digital history and the readiness to work with born-digitalsources cannot stem out of pedagogy alone however Institutionalsupport is necessary While based on anecdotal evidence it appearsthat some university administrators have begun to embrace the shifttowards the digital humanities At the federal level SSHRC has iden-tified the Digital Economy as one of its priority areas58 Yet when itcomes to evaluation doctoral and post-doctoral researchers do nothave a ldquodigital humanitiesrdquo option open to them in terms of assessingcommittees leading to fears that their projects might be lost inbetween disciplinary areas of focus59 In light of the impending crisisof born-digital sources historians should encourage these initiatives

These sorts of digital projects do not always lend themselveswell to traditional publication venues The examples above from theCDC case study for example are best viewed as dynamically mal-leable models this will be one challenge as historians grapple withnew publication models This article argues first why digitalmethodologies are necessary and second how we can realize aspectsof this call The Internet is 20 years old and before historians realizeit we will be faced with the crisis of born-digital information over-load If history is to continue as the leading discipline inunderstanding the social and cultural past decisive movementtowards the digital humanities is necessary When the sea comes inwe must be ready to swim

IAN MILLIGAN is an assistant professor of Canadian history at theUniversity of Waterloo He is a founding co-editor of ActiveHistorycaan editor-at-large of the Programming Historian 2 and has written sev-eral articles in the fields of Canadian youth digital and labour history

IAN MILLIGAN est professeur adjoint en histoire canadienne agrave laUniversity of Waterloo Il est le fondateur et codirecteur du site

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Internet ActiveHistoryca ainsi qursquoun eacutediteur du site ProgrammingHistorian 2 Il a eacutecrit plusieurs articles au sujet de la jeunesse cana-dienne de la numeacuterisation et de lrsquohistoire des travailleurs

Endnotes

1 The question of when the past becomes a ldquohistoryrdquo to be professionallystudied is a tricky one Yet it is worth noting that the still-definitivework on the Baby Boom Generation appeared in 1996 with DougOwram Born at the Right Time A History of the Baby Boom Generationin Canada (Toronto University of Toronto Press 1996) This actuallyfollowed the earlier scholarly monograph Cyril Levitt Children ofPrivilege Student Revolt in the Sixties (Toronto University of TorontoPress 1984) and a host of other laudable works As academicdebatesworks on the 1960s now suggest the past becomes historysooner than we often expect

2 A critical point advanced in the field of literary criticism and theory byStephen Ramsay Reading Machines Towards an Algorithmic Criticism(Urbana University of Illinois Press 2011)

3 See ldquoIdleNoMore Timelinerdquo ongoingwwwidlenomoremakookcatimeline44 ltviewed 18 January 2013gt

4 These issues have been discussed elsewhere albeit in either a moreabridged form or with a focus on digitized historical sources A notableexception is Daniel J Cohen and Roy Rosenzweig Digital History AGuide to Gathering Preserving and Presenting the Past on the Web(Philadelphia University of Pennsylvania Press 2005) specifically theirchapter on ldquoCollecting History onlinerdquo Their focus is on the preservation-ist and archival challenges however Important notes are also found inDan Cohen ldquoHistory and the Second Decade of the Webrdquo RethinkingHistory 8 no 2 (June 2004) 293ndash301 also available online atwwwchnmgmueduessays-on-history-new-mediaessaysessayid=34

5 Figure taken from Jose Martinez ldquoTwitter CEO Dick Costolo Reveals Staggering Number of Tweets Per Dayrdquo ComplexTech wwwcomplexcomtech201210twitter-ceo-dick-costolo-reveals-staggering-number-of-tweets-per-day ltviewed 6 January 2013gtFurther more while a tweet itself is only 140 characters every tweetcomes with over 40 lines of metadata date information author biogra-phy and location information when the account was created the usersrsquofollowing and follower numbers country data and so forth This is aremarkable treasure trove of information for future historians but also apotential source of information overload See Sarah Perez ldquoThis is What

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

a Tweet Looks Likerdquo ReadWriteWeb (19 April 2010) wwwreadwritewebcomarchivesthis_is_what_a_tweet_looks_likephpltviewed 6 January 2013gt

6 This is further discussed in Ian Foster ldquoHow Computation ChangesResearchrdquo in Switching Codes Thinking Through Digital Technology inthe Humanities and the Arts eds Thomas Bartscherer and RoderickCoover (Chicago University of Chicago Press 2011) esp 22ndash4

7 Franco Moretti Graphs Maps Trees Abstract Models for Literary History(New York Verso 2005) 3ndash4

8 While not of prime focus here the longue dureacutee deserves definition Itstands opposite the history of events of instances covering instead avery long time span in an interdisciplinary social science framework In along essay Braudel noted that one can then see both structural crises ofeconomies and structures that constrain human society and develop-ment For more see Fernand Braudel ldquoHistory and the Social SciencesThe Longue Dureacuteerdquo in Fernand Braudel On History trans SarahMatthews (Chicago University of Chicago Press 1980) 27 The essaywas originally printed in the Annales ESC no 4 (OctoberndashDecember1958)

9 A great overview can be found at ldquoText Analysisrdquo Tooling Up for DigitalHumanities 2011 wwwtoolingupstanfordedupage_id=981 ltviewed19 December 2012gt

10 Jean-Baptiste Michel Yuan Kui Shen Aviva Presser Aiden Adrian VeresMatthew K Gray The Google Books Team Joseph P Pickett DaleHoiberg Dan Clancy Peter Norvig Jon Orwant Steven Pinker MartinA Nowak Erez Lieberman Aiden ldquoQuantitative Analysis of CultureUsing Millions of Digitized Booksrdquo Science 331 (14 January 2011)176ndash82 The tool is publicly accessible at ldquoGoogle Books NgramViewerrdquo Google Books 2010 wwwbooksgooglecomngrams ltviewed11 January 2013gt

11 Anthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

12 Comment by Jean-Baptiste Michel and Erez Lieberman Aiden onAnthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

13 For more information see the Canadian Century Research InfrastructureProject website at wwwccriuottawacaCCRIHomehtml and the Programme de recherche en deacutemographie historique at wwwgeneologyumontrealcafrleprdhhtm

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

14 ldquoWith Criminal Intentrdquo CriminalIntentorg wwwcriminalintentorgltviewed 20 December 2012gt

15 For an overview see ldquoAward Recipients mdash 2011rdquo Digging into DataWebpage wwwdiggingintodataorgHomeAwardRecipients2011tabid185Defaultaspx ltviewed 20 December 2012gt

16 One can do no better than the completely free Programming Historianwhich uses entirely free and open-source programming tools to learn thebasics of computational history See The Programming Historian 2 2012-Present wwwprogramminghistorianorg ltviewed 11 January 2013gt

17 Michael Katz The People of Hamilton Canada West Family and Class ina Mid-Nineteenth-Century City (Cambridge MA Harvard UniversityPress 1975) Also see A Gordon Darroch and Michael D OrnsteinldquoEthnicity and Occupational Structure in Canada in 1871 The VerticalMosaic in Historical Perspectiverdquo Canadian Historical Review 61 no 3(1980) 305ndash33

18 Ian Anderson ldquoHistory and Computingrdquo Making History The ChangingFace of the Profession in Britain 2008 wwwhistoryacukmakinghistoryresourcesarticleshistory_and_computinghtml ltviewed 20 December2012gt

19 Christopher Collins ldquoText and Document Visualizationrdquo TheHumanities and Technology Camp Greater Toronto Area VisualizationBootcamp University of Ontario Institute of Technology Oshawa Ont21 October 2011

20 The quotations by Luther and Poe are commonly used to make thispoint For the best example see Clay Shirky Cognitive SurplusCreativity and Generosity in a Connected Age Google eBook Penguin2010 For Mumford see Lewis Mumford The Myth of the Machine vol2 The Pentagon of Power (New York Harcourt Brace 1970) 182 asquoted in Gleick The Information A History A Theory A Flood (NewYork Pantheon 2011) 404

21 Adam Ostrow ldquoCisco CRS-3 Download Library of Congress in JustOver One Secondrdquo Mashable (9 March 2010) accessed onlinewwwmashablecom20100309cisco-crs-3

22 A note on data size is necessary The basic unit of information is a ldquobyterdquo (8 bits) which is roughly a single character A kilobyte (KB) is athousand bytes a megabyte (MB) a million a gigabyte (GB) a billion aterabyte (TB) a trillion and a petabyte (PB) a quadrillion Beyond thisthere are exabytes (EB) zettabytes (ZB) and yottabytes (YB)

23 These figures can of course vary depending on compression methodschosen Their math was based on an average of 8MB per book if it wasscanned at 600dpi and then compressed There are 26 million books inthe Library of Congress For more see Mike Ashenfelder ldquoTransferringlsquoLibraries of Congressrsquo of Datardquo Library of Congress Digital Preservation

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Blog 11 July 2011wwwblogslocgovdigitalpreservation201107transferring-libraries-of-congress-of-data ltviewed 11 January 2013gt

24 Gleick 232 25 ldquoAbout the Library of Congress Web Archivesrdquo Library of Congress Web

Archiving FAQs Undated wwwlocgovwebarchivingfaqhtmlfaqs_02ltviewed 11 January 2013gt

26 ldquo10000000000000000 Bytes Archivedrdquo Internet Archive Blog 26October 2012 wwwblogarchiveorg2012102610000000000000000-bytes-archivedltviewed 19 December 2012gt

27 Jean-Baptiste Michel et al 176ndash82 Published Online Ahead of Print16 December 2010 wwwsciencemagorgcontent3316014176ltviewed 29 July 2011gt See also Scott Cleland ldquoGooglersquos PiracyLiabilityrdquo Forbes 9 November 2011 wwwforbescomsitesscottcle-land20111109googles-piracy-liability ltviewed 11 January 2013gt

28 ldquoGovernment of Canada Web Archiverdquo Library and Archives Canada 17October 2007 wwwcollectionscanadagccawebarchivesindex-ehtmlltviewed 11 January 2013gt

29 For background see Bill Curry ldquoVisiting Library and Archives inOttawa Not Without an Appointmentrdquo Globe and Mail 1 May 2012wwwtheglobeandmailcomnewspoliticsottawa-notebookvisiting-library-and-archives-in-ottawa-not-without-an-appointmentarticle2418960 ltviewed 22 May 2012gt and the announcement ldquoLAC Begins Implementation of New Approach to service Deliveryrdquo Library and Archives Canada February 2012wwwcollectionscanadagccawhats-new013-560-ehtml ltviewed 22May 2012gt For professional responses see the ldquoCHArsquos comment onNew Directions for Library and Archives Canadardquo CHA Website March2012 wwwcha-shccaenNews_39items19html ltviewed 19December 2012gt and ldquoSave Library amp Archives Canadardquo CanadianAssociation of University Teachers launched AprilMay 2012 wwwsavelibraryarchivesca ltviewed 19 December 2012gt

30 Ian Milligan ldquoThe Smokescreen of lsquoModernizationrsquo at Library andArchives Canadardquo ActiveHistoryca 22 May 2012wwwactivehistoryca201205the-smokescreen-of-modernization-at-library-and-archives-canada ltviewed 19 December 2012gt

31 ldquoSave LAC May 2012 Campaign Updaterdquo SaveLibraryArchivesca May2012 wwwsavelibraryarchivescaupdate-2012-05aspx ltviewed 19December 2012gt

32 The challenges and opportunities presented by such projects is extensively discussed in Daniel J Cohen ldquoThe Future of Preserving the Pastrdquo CRM The Journal of Heritage Stewardship 2 no 2

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

(Summer 2005) 6ndash19 also available freely online at the RoyRosenzweig Center for History and New Mediawwwchnmgmueduessays-on-history-new-mediaessaysessayid=39For more see ldquoSeptember 11 Digital Archive Saving the Histories of September 11 2001rdquo Centre for History and New Media andAmerican Social History Project Center for Media and Learningwww911digitalarchiveorg ltviewed 11 January 2013gt ldquoHurricaneDigital Memory Bank Collecting and Preserving the Stories of Katrina and Ritardquo Center for History and New Media wwwhurricanearchiveorg ltviewed 11 January 2013gt and ldquoOccupyArchive Archiving the Occupy Movements from 2011rdquo Centre forHistory and New Media wwwoccupyarchiveorg ltviewed 11 January2013gt

33 Daniel Cohen and Roy Rosenzweig Digital History A Guide to Gathering Preserving and Presenting the Past on the Web(Philadelphia Temple University Press 2005) chapter on ldquoPreservingDigital Historyrdquo The chapter is available fully online atwwwchnmgmuedudigitalhistorypreserving5php ltviewed 19December 2012gt

34 For an overview see ldquoPreserving Our Digital Heritage The NationalDigital Information Infrastructure and Preservation Program 2010 Report a Collaborative Initiative of the Library of CongressrdquoDigitalpreservationgov wwwdigitalpreservationgovmultimediadocumentsNDIIPP2010Report_Postpdf ltviewed 19 December 2012gt

35 Steve Meloan ldquoNo Way to Run a Culturerdquo Wired 13 February 1998available online via the Internet Archivewwwwebarchiveorgweb20000619001705httpwwwwiredcomnewsculture012841030100html ltviewed 19 December 2012gt

36 Some of this has been discussed in greater depth on my website See theseries beginning with Ian Milligan ldquoWARC Files A Challenge forHistorians and Finding Needles in Haystacksrdquo ianmilliganca 12December 2012 wwwianmilliganca20121212warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks ltviewed 20 December2012gt

37 Unstructured data as opposed to structured data is important for histo-rians but brings additional challenges Some projects have producedstructured data novels marked up using XML encoding for example soone can immediately see place names dates character names and soforth This makes it much easier for a computer to read and understand

38 See William Noblett ldquoDigitization A Cautionary Talerdquo New Review ofAcademic Librarianship 17 (2011) 3 Tobias Blanke Michael Bryantand Mark Hedges ldquoOcropodium Open Source OCR for Small-ScaleHistorical Archivesrdquo Journal of Information Science 38 no 1 (2012) 83

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 37: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

comprehensive and marketable digital portfolios Such a coursewould focus more on homework than reading with special emphasisplaced on deliverable content such as websites interactive modelsand broadly implementable databases

Digital history and the readiness to work with born-digitalsources cannot stem out of pedagogy alone however Institutionalsupport is necessary While based on anecdotal evidence it appearsthat some university administrators have begun to embrace the shifttowards the digital humanities At the federal level SSHRC has iden-tified the Digital Economy as one of its priority areas58 Yet when itcomes to evaluation doctoral and post-doctoral researchers do nothave a ldquodigital humanitiesrdquo option open to them in terms of assessingcommittees leading to fears that their projects might be lost inbetween disciplinary areas of focus59 In light of the impending crisisof born-digital sources historians should encourage these initiatives

These sorts of digital projects do not always lend themselveswell to traditional publication venues The examples above from theCDC case study for example are best viewed as dynamically mal-leable models this will be one challenge as historians grapple withnew publication models This article argues first why digitalmethodologies are necessary and second how we can realize aspectsof this call The Internet is 20 years old and before historians realizeit we will be faced with the crisis of born-digital information over-load If history is to continue as the leading discipline inunderstanding the social and cultural past decisive movementtowards the digital humanities is necessary When the sea comes inwe must be ready to swim

IAN MILLIGAN is an assistant professor of Canadian history at theUniversity of Waterloo He is a founding co-editor of ActiveHistorycaan editor-at-large of the Programming Historian 2 and has written sev-eral articles in the fields of Canadian youth digital and labour history

IAN MILLIGAN est professeur adjoint en histoire canadienne agrave laUniversity of Waterloo Il est le fondateur et codirecteur du site

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Internet ActiveHistoryca ainsi qursquoun eacutediteur du site ProgrammingHistorian 2 Il a eacutecrit plusieurs articles au sujet de la jeunesse cana-dienne de la numeacuterisation et de lrsquohistoire des travailleurs

Endnotes

1 The question of when the past becomes a ldquohistoryrdquo to be professionallystudied is a tricky one Yet it is worth noting that the still-definitivework on the Baby Boom Generation appeared in 1996 with DougOwram Born at the Right Time A History of the Baby Boom Generationin Canada (Toronto University of Toronto Press 1996) This actuallyfollowed the earlier scholarly monograph Cyril Levitt Children ofPrivilege Student Revolt in the Sixties (Toronto University of TorontoPress 1984) and a host of other laudable works As academicdebatesworks on the 1960s now suggest the past becomes historysooner than we often expect

2 A critical point advanced in the field of literary criticism and theory byStephen Ramsay Reading Machines Towards an Algorithmic Criticism(Urbana University of Illinois Press 2011)

3 See ldquoIdleNoMore Timelinerdquo ongoingwwwidlenomoremakookcatimeline44 ltviewed 18 January 2013gt

4 These issues have been discussed elsewhere albeit in either a moreabridged form or with a focus on digitized historical sources A notableexception is Daniel J Cohen and Roy Rosenzweig Digital History AGuide to Gathering Preserving and Presenting the Past on the Web(Philadelphia University of Pennsylvania Press 2005) specifically theirchapter on ldquoCollecting History onlinerdquo Their focus is on the preservation-ist and archival challenges however Important notes are also found inDan Cohen ldquoHistory and the Second Decade of the Webrdquo RethinkingHistory 8 no 2 (June 2004) 293ndash301 also available online atwwwchnmgmueduessays-on-history-new-mediaessaysessayid=34

5 Figure taken from Jose Martinez ldquoTwitter CEO Dick Costolo Reveals Staggering Number of Tweets Per Dayrdquo ComplexTech wwwcomplexcomtech201210twitter-ceo-dick-costolo-reveals-staggering-number-of-tweets-per-day ltviewed 6 January 2013gtFurther more while a tweet itself is only 140 characters every tweetcomes with over 40 lines of metadata date information author biogra-phy and location information when the account was created the usersrsquofollowing and follower numbers country data and so forth This is aremarkable treasure trove of information for future historians but also apotential source of information overload See Sarah Perez ldquoThis is What

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

a Tweet Looks Likerdquo ReadWriteWeb (19 April 2010) wwwreadwritewebcomarchivesthis_is_what_a_tweet_looks_likephpltviewed 6 January 2013gt

6 This is further discussed in Ian Foster ldquoHow Computation ChangesResearchrdquo in Switching Codes Thinking Through Digital Technology inthe Humanities and the Arts eds Thomas Bartscherer and RoderickCoover (Chicago University of Chicago Press 2011) esp 22ndash4

7 Franco Moretti Graphs Maps Trees Abstract Models for Literary History(New York Verso 2005) 3ndash4

8 While not of prime focus here the longue dureacutee deserves definition Itstands opposite the history of events of instances covering instead avery long time span in an interdisciplinary social science framework In along essay Braudel noted that one can then see both structural crises ofeconomies and structures that constrain human society and develop-ment For more see Fernand Braudel ldquoHistory and the Social SciencesThe Longue Dureacuteerdquo in Fernand Braudel On History trans SarahMatthews (Chicago University of Chicago Press 1980) 27 The essaywas originally printed in the Annales ESC no 4 (OctoberndashDecember1958)

9 A great overview can be found at ldquoText Analysisrdquo Tooling Up for DigitalHumanities 2011 wwwtoolingupstanfordedupage_id=981 ltviewed19 December 2012gt

10 Jean-Baptiste Michel Yuan Kui Shen Aviva Presser Aiden Adrian VeresMatthew K Gray The Google Books Team Joseph P Pickett DaleHoiberg Dan Clancy Peter Norvig Jon Orwant Steven Pinker MartinA Nowak Erez Lieberman Aiden ldquoQuantitative Analysis of CultureUsing Millions of Digitized Booksrdquo Science 331 (14 January 2011)176ndash82 The tool is publicly accessible at ldquoGoogle Books NgramViewerrdquo Google Books 2010 wwwbooksgooglecomngrams ltviewed11 January 2013gt

11 Anthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

12 Comment by Jean-Baptiste Michel and Erez Lieberman Aiden onAnthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

13 For more information see the Canadian Century Research InfrastructureProject website at wwwccriuottawacaCCRIHomehtml and the Programme de recherche en deacutemographie historique at wwwgeneologyumontrealcafrleprdhhtm

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

14 ldquoWith Criminal Intentrdquo CriminalIntentorg wwwcriminalintentorgltviewed 20 December 2012gt

15 For an overview see ldquoAward Recipients mdash 2011rdquo Digging into DataWebpage wwwdiggingintodataorgHomeAwardRecipients2011tabid185Defaultaspx ltviewed 20 December 2012gt

16 One can do no better than the completely free Programming Historianwhich uses entirely free and open-source programming tools to learn thebasics of computational history See The Programming Historian 2 2012-Present wwwprogramminghistorianorg ltviewed 11 January 2013gt

17 Michael Katz The People of Hamilton Canada West Family and Class ina Mid-Nineteenth-Century City (Cambridge MA Harvard UniversityPress 1975) Also see A Gordon Darroch and Michael D OrnsteinldquoEthnicity and Occupational Structure in Canada in 1871 The VerticalMosaic in Historical Perspectiverdquo Canadian Historical Review 61 no 3(1980) 305ndash33

18 Ian Anderson ldquoHistory and Computingrdquo Making History The ChangingFace of the Profession in Britain 2008 wwwhistoryacukmakinghistoryresourcesarticleshistory_and_computinghtml ltviewed 20 December2012gt

19 Christopher Collins ldquoText and Document Visualizationrdquo TheHumanities and Technology Camp Greater Toronto Area VisualizationBootcamp University of Ontario Institute of Technology Oshawa Ont21 October 2011

20 The quotations by Luther and Poe are commonly used to make thispoint For the best example see Clay Shirky Cognitive SurplusCreativity and Generosity in a Connected Age Google eBook Penguin2010 For Mumford see Lewis Mumford The Myth of the Machine vol2 The Pentagon of Power (New York Harcourt Brace 1970) 182 asquoted in Gleick The Information A History A Theory A Flood (NewYork Pantheon 2011) 404

21 Adam Ostrow ldquoCisco CRS-3 Download Library of Congress in JustOver One Secondrdquo Mashable (9 March 2010) accessed onlinewwwmashablecom20100309cisco-crs-3

22 A note on data size is necessary The basic unit of information is a ldquobyterdquo (8 bits) which is roughly a single character A kilobyte (KB) is athousand bytes a megabyte (MB) a million a gigabyte (GB) a billion aterabyte (TB) a trillion and a petabyte (PB) a quadrillion Beyond thisthere are exabytes (EB) zettabytes (ZB) and yottabytes (YB)

23 These figures can of course vary depending on compression methodschosen Their math was based on an average of 8MB per book if it wasscanned at 600dpi and then compressed There are 26 million books inthe Library of Congress For more see Mike Ashenfelder ldquoTransferringlsquoLibraries of Congressrsquo of Datardquo Library of Congress Digital Preservation

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Blog 11 July 2011wwwblogslocgovdigitalpreservation201107transferring-libraries-of-congress-of-data ltviewed 11 January 2013gt

24 Gleick 232 25 ldquoAbout the Library of Congress Web Archivesrdquo Library of Congress Web

Archiving FAQs Undated wwwlocgovwebarchivingfaqhtmlfaqs_02ltviewed 11 January 2013gt

26 ldquo10000000000000000 Bytes Archivedrdquo Internet Archive Blog 26October 2012 wwwblogarchiveorg2012102610000000000000000-bytes-archivedltviewed 19 December 2012gt

27 Jean-Baptiste Michel et al 176ndash82 Published Online Ahead of Print16 December 2010 wwwsciencemagorgcontent3316014176ltviewed 29 July 2011gt See also Scott Cleland ldquoGooglersquos PiracyLiabilityrdquo Forbes 9 November 2011 wwwforbescomsitesscottcle-land20111109googles-piracy-liability ltviewed 11 January 2013gt

28 ldquoGovernment of Canada Web Archiverdquo Library and Archives Canada 17October 2007 wwwcollectionscanadagccawebarchivesindex-ehtmlltviewed 11 January 2013gt

29 For background see Bill Curry ldquoVisiting Library and Archives inOttawa Not Without an Appointmentrdquo Globe and Mail 1 May 2012wwwtheglobeandmailcomnewspoliticsottawa-notebookvisiting-library-and-archives-in-ottawa-not-without-an-appointmentarticle2418960 ltviewed 22 May 2012gt and the announcement ldquoLAC Begins Implementation of New Approach to service Deliveryrdquo Library and Archives Canada February 2012wwwcollectionscanadagccawhats-new013-560-ehtml ltviewed 22May 2012gt For professional responses see the ldquoCHArsquos comment onNew Directions for Library and Archives Canadardquo CHA Website March2012 wwwcha-shccaenNews_39items19html ltviewed 19December 2012gt and ldquoSave Library amp Archives Canadardquo CanadianAssociation of University Teachers launched AprilMay 2012 wwwsavelibraryarchivesca ltviewed 19 December 2012gt

30 Ian Milligan ldquoThe Smokescreen of lsquoModernizationrsquo at Library andArchives Canadardquo ActiveHistoryca 22 May 2012wwwactivehistoryca201205the-smokescreen-of-modernization-at-library-and-archives-canada ltviewed 19 December 2012gt

31 ldquoSave LAC May 2012 Campaign Updaterdquo SaveLibraryArchivesca May2012 wwwsavelibraryarchivescaupdate-2012-05aspx ltviewed 19December 2012gt

32 The challenges and opportunities presented by such projects is extensively discussed in Daniel J Cohen ldquoThe Future of Preserving the Pastrdquo CRM The Journal of Heritage Stewardship 2 no 2

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

(Summer 2005) 6ndash19 also available freely online at the RoyRosenzweig Center for History and New Mediawwwchnmgmueduessays-on-history-new-mediaessaysessayid=39For more see ldquoSeptember 11 Digital Archive Saving the Histories of September 11 2001rdquo Centre for History and New Media andAmerican Social History Project Center for Media and Learningwww911digitalarchiveorg ltviewed 11 January 2013gt ldquoHurricaneDigital Memory Bank Collecting and Preserving the Stories of Katrina and Ritardquo Center for History and New Media wwwhurricanearchiveorg ltviewed 11 January 2013gt and ldquoOccupyArchive Archiving the Occupy Movements from 2011rdquo Centre forHistory and New Media wwwoccupyarchiveorg ltviewed 11 January2013gt

33 Daniel Cohen and Roy Rosenzweig Digital History A Guide to Gathering Preserving and Presenting the Past on the Web(Philadelphia Temple University Press 2005) chapter on ldquoPreservingDigital Historyrdquo The chapter is available fully online atwwwchnmgmuedudigitalhistorypreserving5php ltviewed 19December 2012gt

34 For an overview see ldquoPreserving Our Digital Heritage The NationalDigital Information Infrastructure and Preservation Program 2010 Report a Collaborative Initiative of the Library of CongressrdquoDigitalpreservationgov wwwdigitalpreservationgovmultimediadocumentsNDIIPP2010Report_Postpdf ltviewed 19 December 2012gt

35 Steve Meloan ldquoNo Way to Run a Culturerdquo Wired 13 February 1998available online via the Internet Archivewwwwebarchiveorgweb20000619001705httpwwwwiredcomnewsculture012841030100html ltviewed 19 December 2012gt

36 Some of this has been discussed in greater depth on my website See theseries beginning with Ian Milligan ldquoWARC Files A Challenge forHistorians and Finding Needles in Haystacksrdquo ianmilliganca 12December 2012 wwwianmilliganca20121212warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks ltviewed 20 December2012gt

37 Unstructured data as opposed to structured data is important for histo-rians but brings additional challenges Some projects have producedstructured data novels marked up using XML encoding for example soone can immediately see place names dates character names and soforth This makes it much easier for a computer to read and understand

38 See William Noblett ldquoDigitization A Cautionary Talerdquo New Review ofAcademic Librarianship 17 (2011) 3 Tobias Blanke Michael Bryantand Mark Hedges ldquoOcropodium Open Source OCR for Small-ScaleHistorical Archivesrdquo Journal of Information Science 38 no 1 (2012) 83

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 38: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

Internet ActiveHistoryca ainsi qursquoun eacutediteur du site ProgrammingHistorian 2 Il a eacutecrit plusieurs articles au sujet de la jeunesse cana-dienne de la numeacuterisation et de lrsquohistoire des travailleurs

Endnotes

1 The question of when the past becomes a ldquohistoryrdquo to be professionallystudied is a tricky one Yet it is worth noting that the still-definitivework on the Baby Boom Generation appeared in 1996 with DougOwram Born at the Right Time A History of the Baby Boom Generationin Canada (Toronto University of Toronto Press 1996) This actuallyfollowed the earlier scholarly monograph Cyril Levitt Children ofPrivilege Student Revolt in the Sixties (Toronto University of TorontoPress 1984) and a host of other laudable works As academicdebatesworks on the 1960s now suggest the past becomes historysooner than we often expect

2 A critical point advanced in the field of literary criticism and theory byStephen Ramsay Reading Machines Towards an Algorithmic Criticism(Urbana University of Illinois Press 2011)

3 See ldquoIdleNoMore Timelinerdquo ongoingwwwidlenomoremakookcatimeline44 ltviewed 18 January 2013gt

4 These issues have been discussed elsewhere albeit in either a moreabridged form or with a focus on digitized historical sources A notableexception is Daniel J Cohen and Roy Rosenzweig Digital History AGuide to Gathering Preserving and Presenting the Past on the Web(Philadelphia University of Pennsylvania Press 2005) specifically theirchapter on ldquoCollecting History onlinerdquo Their focus is on the preservation-ist and archival challenges however Important notes are also found inDan Cohen ldquoHistory and the Second Decade of the Webrdquo RethinkingHistory 8 no 2 (June 2004) 293ndash301 also available online atwwwchnmgmueduessays-on-history-new-mediaessaysessayid=34

5 Figure taken from Jose Martinez ldquoTwitter CEO Dick Costolo Reveals Staggering Number of Tweets Per Dayrdquo ComplexTech wwwcomplexcomtech201210twitter-ceo-dick-costolo-reveals-staggering-number-of-tweets-per-day ltviewed 6 January 2013gtFurther more while a tweet itself is only 140 characters every tweetcomes with over 40 lines of metadata date information author biogra-phy and location information when the account was created the usersrsquofollowing and follower numbers country data and so forth This is aremarkable treasure trove of information for future historians but also apotential source of information overload See Sarah Perez ldquoThis is What

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

a Tweet Looks Likerdquo ReadWriteWeb (19 April 2010) wwwreadwritewebcomarchivesthis_is_what_a_tweet_looks_likephpltviewed 6 January 2013gt

6 This is further discussed in Ian Foster ldquoHow Computation ChangesResearchrdquo in Switching Codes Thinking Through Digital Technology inthe Humanities and the Arts eds Thomas Bartscherer and RoderickCoover (Chicago University of Chicago Press 2011) esp 22ndash4

7 Franco Moretti Graphs Maps Trees Abstract Models for Literary History(New York Verso 2005) 3ndash4

8 While not of prime focus here the longue dureacutee deserves definition Itstands opposite the history of events of instances covering instead avery long time span in an interdisciplinary social science framework In along essay Braudel noted that one can then see both structural crises ofeconomies and structures that constrain human society and develop-ment For more see Fernand Braudel ldquoHistory and the Social SciencesThe Longue Dureacuteerdquo in Fernand Braudel On History trans SarahMatthews (Chicago University of Chicago Press 1980) 27 The essaywas originally printed in the Annales ESC no 4 (OctoberndashDecember1958)

9 A great overview can be found at ldquoText Analysisrdquo Tooling Up for DigitalHumanities 2011 wwwtoolingupstanfordedupage_id=981 ltviewed19 December 2012gt

10 Jean-Baptiste Michel Yuan Kui Shen Aviva Presser Aiden Adrian VeresMatthew K Gray The Google Books Team Joseph P Pickett DaleHoiberg Dan Clancy Peter Norvig Jon Orwant Steven Pinker MartinA Nowak Erez Lieberman Aiden ldquoQuantitative Analysis of CultureUsing Millions of Digitized Booksrdquo Science 331 (14 January 2011)176ndash82 The tool is publicly accessible at ldquoGoogle Books NgramViewerrdquo Google Books 2010 wwwbooksgooglecomngrams ltviewed11 January 2013gt

11 Anthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

12 Comment by Jean-Baptiste Michel and Erez Lieberman Aiden onAnthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

13 For more information see the Canadian Century Research InfrastructureProject website at wwwccriuottawacaCCRIHomehtml and the Programme de recherche en deacutemographie historique at wwwgeneologyumontrealcafrleprdhhtm

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

14 ldquoWith Criminal Intentrdquo CriminalIntentorg wwwcriminalintentorgltviewed 20 December 2012gt

15 For an overview see ldquoAward Recipients mdash 2011rdquo Digging into DataWebpage wwwdiggingintodataorgHomeAwardRecipients2011tabid185Defaultaspx ltviewed 20 December 2012gt

16 One can do no better than the completely free Programming Historianwhich uses entirely free and open-source programming tools to learn thebasics of computational history See The Programming Historian 2 2012-Present wwwprogramminghistorianorg ltviewed 11 January 2013gt

17 Michael Katz The People of Hamilton Canada West Family and Class ina Mid-Nineteenth-Century City (Cambridge MA Harvard UniversityPress 1975) Also see A Gordon Darroch and Michael D OrnsteinldquoEthnicity and Occupational Structure in Canada in 1871 The VerticalMosaic in Historical Perspectiverdquo Canadian Historical Review 61 no 3(1980) 305ndash33

18 Ian Anderson ldquoHistory and Computingrdquo Making History The ChangingFace of the Profession in Britain 2008 wwwhistoryacukmakinghistoryresourcesarticleshistory_and_computinghtml ltviewed 20 December2012gt

19 Christopher Collins ldquoText and Document Visualizationrdquo TheHumanities and Technology Camp Greater Toronto Area VisualizationBootcamp University of Ontario Institute of Technology Oshawa Ont21 October 2011

20 The quotations by Luther and Poe are commonly used to make thispoint For the best example see Clay Shirky Cognitive SurplusCreativity and Generosity in a Connected Age Google eBook Penguin2010 For Mumford see Lewis Mumford The Myth of the Machine vol2 The Pentagon of Power (New York Harcourt Brace 1970) 182 asquoted in Gleick The Information A History A Theory A Flood (NewYork Pantheon 2011) 404

21 Adam Ostrow ldquoCisco CRS-3 Download Library of Congress in JustOver One Secondrdquo Mashable (9 March 2010) accessed onlinewwwmashablecom20100309cisco-crs-3

22 A note on data size is necessary The basic unit of information is a ldquobyterdquo (8 bits) which is roughly a single character A kilobyte (KB) is athousand bytes a megabyte (MB) a million a gigabyte (GB) a billion aterabyte (TB) a trillion and a petabyte (PB) a quadrillion Beyond thisthere are exabytes (EB) zettabytes (ZB) and yottabytes (YB)

23 These figures can of course vary depending on compression methodschosen Their math was based on an average of 8MB per book if it wasscanned at 600dpi and then compressed There are 26 million books inthe Library of Congress For more see Mike Ashenfelder ldquoTransferringlsquoLibraries of Congressrsquo of Datardquo Library of Congress Digital Preservation

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Blog 11 July 2011wwwblogslocgovdigitalpreservation201107transferring-libraries-of-congress-of-data ltviewed 11 January 2013gt

24 Gleick 232 25 ldquoAbout the Library of Congress Web Archivesrdquo Library of Congress Web

Archiving FAQs Undated wwwlocgovwebarchivingfaqhtmlfaqs_02ltviewed 11 January 2013gt

26 ldquo10000000000000000 Bytes Archivedrdquo Internet Archive Blog 26October 2012 wwwblogarchiveorg2012102610000000000000000-bytes-archivedltviewed 19 December 2012gt

27 Jean-Baptiste Michel et al 176ndash82 Published Online Ahead of Print16 December 2010 wwwsciencemagorgcontent3316014176ltviewed 29 July 2011gt See also Scott Cleland ldquoGooglersquos PiracyLiabilityrdquo Forbes 9 November 2011 wwwforbescomsitesscottcle-land20111109googles-piracy-liability ltviewed 11 January 2013gt

28 ldquoGovernment of Canada Web Archiverdquo Library and Archives Canada 17October 2007 wwwcollectionscanadagccawebarchivesindex-ehtmlltviewed 11 January 2013gt

29 For background see Bill Curry ldquoVisiting Library and Archives inOttawa Not Without an Appointmentrdquo Globe and Mail 1 May 2012wwwtheglobeandmailcomnewspoliticsottawa-notebookvisiting-library-and-archives-in-ottawa-not-without-an-appointmentarticle2418960 ltviewed 22 May 2012gt and the announcement ldquoLAC Begins Implementation of New Approach to service Deliveryrdquo Library and Archives Canada February 2012wwwcollectionscanadagccawhats-new013-560-ehtml ltviewed 22May 2012gt For professional responses see the ldquoCHArsquos comment onNew Directions for Library and Archives Canadardquo CHA Website March2012 wwwcha-shccaenNews_39items19html ltviewed 19December 2012gt and ldquoSave Library amp Archives Canadardquo CanadianAssociation of University Teachers launched AprilMay 2012 wwwsavelibraryarchivesca ltviewed 19 December 2012gt

30 Ian Milligan ldquoThe Smokescreen of lsquoModernizationrsquo at Library andArchives Canadardquo ActiveHistoryca 22 May 2012wwwactivehistoryca201205the-smokescreen-of-modernization-at-library-and-archives-canada ltviewed 19 December 2012gt

31 ldquoSave LAC May 2012 Campaign Updaterdquo SaveLibraryArchivesca May2012 wwwsavelibraryarchivescaupdate-2012-05aspx ltviewed 19December 2012gt

32 The challenges and opportunities presented by such projects is extensively discussed in Daniel J Cohen ldquoThe Future of Preserving the Pastrdquo CRM The Journal of Heritage Stewardship 2 no 2

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

(Summer 2005) 6ndash19 also available freely online at the RoyRosenzweig Center for History and New Mediawwwchnmgmueduessays-on-history-new-mediaessaysessayid=39For more see ldquoSeptember 11 Digital Archive Saving the Histories of September 11 2001rdquo Centre for History and New Media andAmerican Social History Project Center for Media and Learningwww911digitalarchiveorg ltviewed 11 January 2013gt ldquoHurricaneDigital Memory Bank Collecting and Preserving the Stories of Katrina and Ritardquo Center for History and New Media wwwhurricanearchiveorg ltviewed 11 January 2013gt and ldquoOccupyArchive Archiving the Occupy Movements from 2011rdquo Centre forHistory and New Media wwwoccupyarchiveorg ltviewed 11 January2013gt

33 Daniel Cohen and Roy Rosenzweig Digital History A Guide to Gathering Preserving and Presenting the Past on the Web(Philadelphia Temple University Press 2005) chapter on ldquoPreservingDigital Historyrdquo The chapter is available fully online atwwwchnmgmuedudigitalhistorypreserving5php ltviewed 19December 2012gt

34 For an overview see ldquoPreserving Our Digital Heritage The NationalDigital Information Infrastructure and Preservation Program 2010 Report a Collaborative Initiative of the Library of CongressrdquoDigitalpreservationgov wwwdigitalpreservationgovmultimediadocumentsNDIIPP2010Report_Postpdf ltviewed 19 December 2012gt

35 Steve Meloan ldquoNo Way to Run a Culturerdquo Wired 13 February 1998available online via the Internet Archivewwwwebarchiveorgweb20000619001705httpwwwwiredcomnewsculture012841030100html ltviewed 19 December 2012gt

36 Some of this has been discussed in greater depth on my website See theseries beginning with Ian Milligan ldquoWARC Files A Challenge forHistorians and Finding Needles in Haystacksrdquo ianmilliganca 12December 2012 wwwianmilliganca20121212warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks ltviewed 20 December2012gt

37 Unstructured data as opposed to structured data is important for histo-rians but brings additional challenges Some projects have producedstructured data novels marked up using XML encoding for example soone can immediately see place names dates character names and soforth This makes it much easier for a computer to read and understand

38 See William Noblett ldquoDigitization A Cautionary Talerdquo New Review ofAcademic Librarianship 17 (2011) 3 Tobias Blanke Michael Bryantand Mark Hedges ldquoOcropodium Open Source OCR for Small-ScaleHistorical Archivesrdquo Journal of Information Science 38 no 1 (2012) 83

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 39: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

a Tweet Looks Likerdquo ReadWriteWeb (19 April 2010) wwwreadwritewebcomarchivesthis_is_what_a_tweet_looks_likephpltviewed 6 January 2013gt

6 This is further discussed in Ian Foster ldquoHow Computation ChangesResearchrdquo in Switching Codes Thinking Through Digital Technology inthe Humanities and the Arts eds Thomas Bartscherer and RoderickCoover (Chicago University of Chicago Press 2011) esp 22ndash4

7 Franco Moretti Graphs Maps Trees Abstract Models for Literary History(New York Verso 2005) 3ndash4

8 While not of prime focus here the longue dureacutee deserves definition Itstands opposite the history of events of instances covering instead avery long time span in an interdisciplinary social science framework In along essay Braudel noted that one can then see both structural crises ofeconomies and structures that constrain human society and develop-ment For more see Fernand Braudel ldquoHistory and the Social SciencesThe Longue Dureacuteerdquo in Fernand Braudel On History trans SarahMatthews (Chicago University of Chicago Press 1980) 27 The essaywas originally printed in the Annales ESC no 4 (OctoberndashDecember1958)

9 A great overview can be found at ldquoText Analysisrdquo Tooling Up for DigitalHumanities 2011 wwwtoolingupstanfordedupage_id=981 ltviewed19 December 2012gt

10 Jean-Baptiste Michel Yuan Kui Shen Aviva Presser Aiden Adrian VeresMatthew K Gray The Google Books Team Joseph P Pickett DaleHoiberg Dan Clancy Peter Norvig Jon Orwant Steven Pinker MartinA Nowak Erez Lieberman Aiden ldquoQuantitative Analysis of CultureUsing Millions of Digitized Booksrdquo Science 331 (14 January 2011)176ndash82 The tool is publicly accessible at ldquoGoogle Books NgramViewerrdquo Google Books 2010 wwwbooksgooglecomngrams ltviewed11 January 2013gt

11 Anthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

12 Comment by Jean-Baptiste Michel and Erez Lieberman Aiden onAnthony Grafton ldquoLoneliness and Freedomrdquo Perspectives on History[online edition] (March 2011)wwwhistoriansorgperspectivesissues201111031103pre1cfmltviewed 11 January 2013gt

13 For more information see the Canadian Century Research InfrastructureProject website at wwwccriuottawacaCCRIHomehtml and the Programme de recherche en deacutemographie historique at wwwgeneologyumontrealcafrleprdhhtm

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

14 ldquoWith Criminal Intentrdquo CriminalIntentorg wwwcriminalintentorgltviewed 20 December 2012gt

15 For an overview see ldquoAward Recipients mdash 2011rdquo Digging into DataWebpage wwwdiggingintodataorgHomeAwardRecipients2011tabid185Defaultaspx ltviewed 20 December 2012gt

16 One can do no better than the completely free Programming Historianwhich uses entirely free and open-source programming tools to learn thebasics of computational history See The Programming Historian 2 2012-Present wwwprogramminghistorianorg ltviewed 11 January 2013gt

17 Michael Katz The People of Hamilton Canada West Family and Class ina Mid-Nineteenth-Century City (Cambridge MA Harvard UniversityPress 1975) Also see A Gordon Darroch and Michael D OrnsteinldquoEthnicity and Occupational Structure in Canada in 1871 The VerticalMosaic in Historical Perspectiverdquo Canadian Historical Review 61 no 3(1980) 305ndash33

18 Ian Anderson ldquoHistory and Computingrdquo Making History The ChangingFace of the Profession in Britain 2008 wwwhistoryacukmakinghistoryresourcesarticleshistory_and_computinghtml ltviewed 20 December2012gt

19 Christopher Collins ldquoText and Document Visualizationrdquo TheHumanities and Technology Camp Greater Toronto Area VisualizationBootcamp University of Ontario Institute of Technology Oshawa Ont21 October 2011

20 The quotations by Luther and Poe are commonly used to make thispoint For the best example see Clay Shirky Cognitive SurplusCreativity and Generosity in a Connected Age Google eBook Penguin2010 For Mumford see Lewis Mumford The Myth of the Machine vol2 The Pentagon of Power (New York Harcourt Brace 1970) 182 asquoted in Gleick The Information A History A Theory A Flood (NewYork Pantheon 2011) 404

21 Adam Ostrow ldquoCisco CRS-3 Download Library of Congress in JustOver One Secondrdquo Mashable (9 March 2010) accessed onlinewwwmashablecom20100309cisco-crs-3

22 A note on data size is necessary The basic unit of information is a ldquobyterdquo (8 bits) which is roughly a single character A kilobyte (KB) is athousand bytes a megabyte (MB) a million a gigabyte (GB) a billion aterabyte (TB) a trillion and a petabyte (PB) a quadrillion Beyond thisthere are exabytes (EB) zettabytes (ZB) and yottabytes (YB)

23 These figures can of course vary depending on compression methodschosen Their math was based on an average of 8MB per book if it wasscanned at 600dpi and then compressed There are 26 million books inthe Library of Congress For more see Mike Ashenfelder ldquoTransferringlsquoLibraries of Congressrsquo of Datardquo Library of Congress Digital Preservation

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Blog 11 July 2011wwwblogslocgovdigitalpreservation201107transferring-libraries-of-congress-of-data ltviewed 11 January 2013gt

24 Gleick 232 25 ldquoAbout the Library of Congress Web Archivesrdquo Library of Congress Web

Archiving FAQs Undated wwwlocgovwebarchivingfaqhtmlfaqs_02ltviewed 11 January 2013gt

26 ldquo10000000000000000 Bytes Archivedrdquo Internet Archive Blog 26October 2012 wwwblogarchiveorg2012102610000000000000000-bytes-archivedltviewed 19 December 2012gt

27 Jean-Baptiste Michel et al 176ndash82 Published Online Ahead of Print16 December 2010 wwwsciencemagorgcontent3316014176ltviewed 29 July 2011gt See also Scott Cleland ldquoGooglersquos PiracyLiabilityrdquo Forbes 9 November 2011 wwwforbescomsitesscottcle-land20111109googles-piracy-liability ltviewed 11 January 2013gt

28 ldquoGovernment of Canada Web Archiverdquo Library and Archives Canada 17October 2007 wwwcollectionscanadagccawebarchivesindex-ehtmlltviewed 11 January 2013gt

29 For background see Bill Curry ldquoVisiting Library and Archives inOttawa Not Without an Appointmentrdquo Globe and Mail 1 May 2012wwwtheglobeandmailcomnewspoliticsottawa-notebookvisiting-library-and-archives-in-ottawa-not-without-an-appointmentarticle2418960 ltviewed 22 May 2012gt and the announcement ldquoLAC Begins Implementation of New Approach to service Deliveryrdquo Library and Archives Canada February 2012wwwcollectionscanadagccawhats-new013-560-ehtml ltviewed 22May 2012gt For professional responses see the ldquoCHArsquos comment onNew Directions for Library and Archives Canadardquo CHA Website March2012 wwwcha-shccaenNews_39items19html ltviewed 19December 2012gt and ldquoSave Library amp Archives Canadardquo CanadianAssociation of University Teachers launched AprilMay 2012 wwwsavelibraryarchivesca ltviewed 19 December 2012gt

30 Ian Milligan ldquoThe Smokescreen of lsquoModernizationrsquo at Library andArchives Canadardquo ActiveHistoryca 22 May 2012wwwactivehistoryca201205the-smokescreen-of-modernization-at-library-and-archives-canada ltviewed 19 December 2012gt

31 ldquoSave LAC May 2012 Campaign Updaterdquo SaveLibraryArchivesca May2012 wwwsavelibraryarchivescaupdate-2012-05aspx ltviewed 19December 2012gt

32 The challenges and opportunities presented by such projects is extensively discussed in Daniel J Cohen ldquoThe Future of Preserving the Pastrdquo CRM The Journal of Heritage Stewardship 2 no 2

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

(Summer 2005) 6ndash19 also available freely online at the RoyRosenzweig Center for History and New Mediawwwchnmgmueduessays-on-history-new-mediaessaysessayid=39For more see ldquoSeptember 11 Digital Archive Saving the Histories of September 11 2001rdquo Centre for History and New Media andAmerican Social History Project Center for Media and Learningwww911digitalarchiveorg ltviewed 11 January 2013gt ldquoHurricaneDigital Memory Bank Collecting and Preserving the Stories of Katrina and Ritardquo Center for History and New Media wwwhurricanearchiveorg ltviewed 11 January 2013gt and ldquoOccupyArchive Archiving the Occupy Movements from 2011rdquo Centre forHistory and New Media wwwoccupyarchiveorg ltviewed 11 January2013gt

33 Daniel Cohen and Roy Rosenzweig Digital History A Guide to Gathering Preserving and Presenting the Past on the Web(Philadelphia Temple University Press 2005) chapter on ldquoPreservingDigital Historyrdquo The chapter is available fully online atwwwchnmgmuedudigitalhistorypreserving5php ltviewed 19December 2012gt

34 For an overview see ldquoPreserving Our Digital Heritage The NationalDigital Information Infrastructure and Preservation Program 2010 Report a Collaborative Initiative of the Library of CongressrdquoDigitalpreservationgov wwwdigitalpreservationgovmultimediadocumentsNDIIPP2010Report_Postpdf ltviewed 19 December 2012gt

35 Steve Meloan ldquoNo Way to Run a Culturerdquo Wired 13 February 1998available online via the Internet Archivewwwwebarchiveorgweb20000619001705httpwwwwiredcomnewsculture012841030100html ltviewed 19 December 2012gt

36 Some of this has been discussed in greater depth on my website See theseries beginning with Ian Milligan ldquoWARC Files A Challenge forHistorians and Finding Needles in Haystacksrdquo ianmilliganca 12December 2012 wwwianmilliganca20121212warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks ltviewed 20 December2012gt

37 Unstructured data as opposed to structured data is important for histo-rians but brings additional challenges Some projects have producedstructured data novels marked up using XML encoding for example soone can immediately see place names dates character names and soforth This makes it much easier for a computer to read and understand

38 See William Noblett ldquoDigitization A Cautionary Talerdquo New Review ofAcademic Librarianship 17 (2011) 3 Tobias Blanke Michael Bryantand Mark Hedges ldquoOcropodium Open Source OCR for Small-ScaleHistorical Archivesrdquo Journal of Information Science 38 no 1 (2012) 83

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 40: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

14 ldquoWith Criminal Intentrdquo CriminalIntentorg wwwcriminalintentorgltviewed 20 December 2012gt

15 For an overview see ldquoAward Recipients mdash 2011rdquo Digging into DataWebpage wwwdiggingintodataorgHomeAwardRecipients2011tabid185Defaultaspx ltviewed 20 December 2012gt

16 One can do no better than the completely free Programming Historianwhich uses entirely free and open-source programming tools to learn thebasics of computational history See The Programming Historian 2 2012-Present wwwprogramminghistorianorg ltviewed 11 January 2013gt

17 Michael Katz The People of Hamilton Canada West Family and Class ina Mid-Nineteenth-Century City (Cambridge MA Harvard UniversityPress 1975) Also see A Gordon Darroch and Michael D OrnsteinldquoEthnicity and Occupational Structure in Canada in 1871 The VerticalMosaic in Historical Perspectiverdquo Canadian Historical Review 61 no 3(1980) 305ndash33

18 Ian Anderson ldquoHistory and Computingrdquo Making History The ChangingFace of the Profession in Britain 2008 wwwhistoryacukmakinghistoryresourcesarticleshistory_and_computinghtml ltviewed 20 December2012gt

19 Christopher Collins ldquoText and Document Visualizationrdquo TheHumanities and Technology Camp Greater Toronto Area VisualizationBootcamp University of Ontario Institute of Technology Oshawa Ont21 October 2011

20 The quotations by Luther and Poe are commonly used to make thispoint For the best example see Clay Shirky Cognitive SurplusCreativity and Generosity in a Connected Age Google eBook Penguin2010 For Mumford see Lewis Mumford The Myth of the Machine vol2 The Pentagon of Power (New York Harcourt Brace 1970) 182 asquoted in Gleick The Information A History A Theory A Flood (NewYork Pantheon 2011) 404

21 Adam Ostrow ldquoCisco CRS-3 Download Library of Congress in JustOver One Secondrdquo Mashable (9 March 2010) accessed onlinewwwmashablecom20100309cisco-crs-3

22 A note on data size is necessary The basic unit of information is a ldquobyterdquo (8 bits) which is roughly a single character A kilobyte (KB) is athousand bytes a megabyte (MB) a million a gigabyte (GB) a billion aterabyte (TB) a trillion and a petabyte (PB) a quadrillion Beyond thisthere are exabytes (EB) zettabytes (ZB) and yottabytes (YB)

23 These figures can of course vary depending on compression methodschosen Their math was based on an average of 8MB per book if it wasscanned at 600dpi and then compressed There are 26 million books inthe Library of Congress For more see Mike Ashenfelder ldquoTransferringlsquoLibraries of Congressrsquo of Datardquo Library of Congress Digital Preservation

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

Blog 11 July 2011wwwblogslocgovdigitalpreservation201107transferring-libraries-of-congress-of-data ltviewed 11 January 2013gt

24 Gleick 232 25 ldquoAbout the Library of Congress Web Archivesrdquo Library of Congress Web

Archiving FAQs Undated wwwlocgovwebarchivingfaqhtmlfaqs_02ltviewed 11 January 2013gt

26 ldquo10000000000000000 Bytes Archivedrdquo Internet Archive Blog 26October 2012 wwwblogarchiveorg2012102610000000000000000-bytes-archivedltviewed 19 December 2012gt

27 Jean-Baptiste Michel et al 176ndash82 Published Online Ahead of Print16 December 2010 wwwsciencemagorgcontent3316014176ltviewed 29 July 2011gt See also Scott Cleland ldquoGooglersquos PiracyLiabilityrdquo Forbes 9 November 2011 wwwforbescomsitesscottcle-land20111109googles-piracy-liability ltviewed 11 January 2013gt

28 ldquoGovernment of Canada Web Archiverdquo Library and Archives Canada 17October 2007 wwwcollectionscanadagccawebarchivesindex-ehtmlltviewed 11 January 2013gt

29 For background see Bill Curry ldquoVisiting Library and Archives inOttawa Not Without an Appointmentrdquo Globe and Mail 1 May 2012wwwtheglobeandmailcomnewspoliticsottawa-notebookvisiting-library-and-archives-in-ottawa-not-without-an-appointmentarticle2418960 ltviewed 22 May 2012gt and the announcement ldquoLAC Begins Implementation of New Approach to service Deliveryrdquo Library and Archives Canada February 2012wwwcollectionscanadagccawhats-new013-560-ehtml ltviewed 22May 2012gt For professional responses see the ldquoCHArsquos comment onNew Directions for Library and Archives Canadardquo CHA Website March2012 wwwcha-shccaenNews_39items19html ltviewed 19December 2012gt and ldquoSave Library amp Archives Canadardquo CanadianAssociation of University Teachers launched AprilMay 2012 wwwsavelibraryarchivesca ltviewed 19 December 2012gt

30 Ian Milligan ldquoThe Smokescreen of lsquoModernizationrsquo at Library andArchives Canadardquo ActiveHistoryca 22 May 2012wwwactivehistoryca201205the-smokescreen-of-modernization-at-library-and-archives-canada ltviewed 19 December 2012gt

31 ldquoSave LAC May 2012 Campaign Updaterdquo SaveLibraryArchivesca May2012 wwwsavelibraryarchivescaupdate-2012-05aspx ltviewed 19December 2012gt

32 The challenges and opportunities presented by such projects is extensively discussed in Daniel J Cohen ldquoThe Future of Preserving the Pastrdquo CRM The Journal of Heritage Stewardship 2 no 2

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

(Summer 2005) 6ndash19 also available freely online at the RoyRosenzweig Center for History and New Mediawwwchnmgmueduessays-on-history-new-mediaessaysessayid=39For more see ldquoSeptember 11 Digital Archive Saving the Histories of September 11 2001rdquo Centre for History and New Media andAmerican Social History Project Center for Media and Learningwww911digitalarchiveorg ltviewed 11 January 2013gt ldquoHurricaneDigital Memory Bank Collecting and Preserving the Stories of Katrina and Ritardquo Center for History and New Media wwwhurricanearchiveorg ltviewed 11 January 2013gt and ldquoOccupyArchive Archiving the Occupy Movements from 2011rdquo Centre forHistory and New Media wwwoccupyarchiveorg ltviewed 11 January2013gt

33 Daniel Cohen and Roy Rosenzweig Digital History A Guide to Gathering Preserving and Presenting the Past on the Web(Philadelphia Temple University Press 2005) chapter on ldquoPreservingDigital Historyrdquo The chapter is available fully online atwwwchnmgmuedudigitalhistorypreserving5php ltviewed 19December 2012gt

34 For an overview see ldquoPreserving Our Digital Heritage The NationalDigital Information Infrastructure and Preservation Program 2010 Report a Collaborative Initiative of the Library of CongressrdquoDigitalpreservationgov wwwdigitalpreservationgovmultimediadocumentsNDIIPP2010Report_Postpdf ltviewed 19 December 2012gt

35 Steve Meloan ldquoNo Way to Run a Culturerdquo Wired 13 February 1998available online via the Internet Archivewwwwebarchiveorgweb20000619001705httpwwwwiredcomnewsculture012841030100html ltviewed 19 December 2012gt

36 Some of this has been discussed in greater depth on my website See theseries beginning with Ian Milligan ldquoWARC Files A Challenge forHistorians and Finding Needles in Haystacksrdquo ianmilliganca 12December 2012 wwwianmilliganca20121212warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks ltviewed 20 December2012gt

37 Unstructured data as opposed to structured data is important for histo-rians but brings additional challenges Some projects have producedstructured data novels marked up using XML encoding for example soone can immediately see place names dates character names and soforth This makes it much easier for a computer to read and understand

38 See William Noblett ldquoDigitization A Cautionary Talerdquo New Review ofAcademic Librarianship 17 (2011) 3 Tobias Blanke Michael Bryantand Mark Hedges ldquoOcropodium Open Source OCR for Small-ScaleHistorical Archivesrdquo Journal of Information Science 38 no 1 (2012) 83

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 41: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

Blog 11 July 2011wwwblogslocgovdigitalpreservation201107transferring-libraries-of-congress-of-data ltviewed 11 January 2013gt

24 Gleick 232 25 ldquoAbout the Library of Congress Web Archivesrdquo Library of Congress Web

Archiving FAQs Undated wwwlocgovwebarchivingfaqhtmlfaqs_02ltviewed 11 January 2013gt

26 ldquo10000000000000000 Bytes Archivedrdquo Internet Archive Blog 26October 2012 wwwblogarchiveorg2012102610000000000000000-bytes-archivedltviewed 19 December 2012gt

27 Jean-Baptiste Michel et al 176ndash82 Published Online Ahead of Print16 December 2010 wwwsciencemagorgcontent3316014176ltviewed 29 July 2011gt See also Scott Cleland ldquoGooglersquos PiracyLiabilityrdquo Forbes 9 November 2011 wwwforbescomsitesscottcle-land20111109googles-piracy-liability ltviewed 11 January 2013gt

28 ldquoGovernment of Canada Web Archiverdquo Library and Archives Canada 17October 2007 wwwcollectionscanadagccawebarchivesindex-ehtmlltviewed 11 January 2013gt

29 For background see Bill Curry ldquoVisiting Library and Archives inOttawa Not Without an Appointmentrdquo Globe and Mail 1 May 2012wwwtheglobeandmailcomnewspoliticsottawa-notebookvisiting-library-and-archives-in-ottawa-not-without-an-appointmentarticle2418960 ltviewed 22 May 2012gt and the announcement ldquoLAC Begins Implementation of New Approach to service Deliveryrdquo Library and Archives Canada February 2012wwwcollectionscanadagccawhats-new013-560-ehtml ltviewed 22May 2012gt For professional responses see the ldquoCHArsquos comment onNew Directions for Library and Archives Canadardquo CHA Website March2012 wwwcha-shccaenNews_39items19html ltviewed 19December 2012gt and ldquoSave Library amp Archives Canadardquo CanadianAssociation of University Teachers launched AprilMay 2012 wwwsavelibraryarchivesca ltviewed 19 December 2012gt

30 Ian Milligan ldquoThe Smokescreen of lsquoModernizationrsquo at Library andArchives Canadardquo ActiveHistoryca 22 May 2012wwwactivehistoryca201205the-smokescreen-of-modernization-at-library-and-archives-canada ltviewed 19 December 2012gt

31 ldquoSave LAC May 2012 Campaign Updaterdquo SaveLibraryArchivesca May2012 wwwsavelibraryarchivescaupdate-2012-05aspx ltviewed 19December 2012gt

32 The challenges and opportunities presented by such projects is extensively discussed in Daniel J Cohen ldquoThe Future of Preserving the Pastrdquo CRM The Journal of Heritage Stewardship 2 no 2

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

(Summer 2005) 6ndash19 also available freely online at the RoyRosenzweig Center for History and New Mediawwwchnmgmueduessays-on-history-new-mediaessaysessayid=39For more see ldquoSeptember 11 Digital Archive Saving the Histories of September 11 2001rdquo Centre for History and New Media andAmerican Social History Project Center for Media and Learningwww911digitalarchiveorg ltviewed 11 January 2013gt ldquoHurricaneDigital Memory Bank Collecting and Preserving the Stories of Katrina and Ritardquo Center for History and New Media wwwhurricanearchiveorg ltviewed 11 January 2013gt and ldquoOccupyArchive Archiving the Occupy Movements from 2011rdquo Centre forHistory and New Media wwwoccupyarchiveorg ltviewed 11 January2013gt

33 Daniel Cohen and Roy Rosenzweig Digital History A Guide to Gathering Preserving and Presenting the Past on the Web(Philadelphia Temple University Press 2005) chapter on ldquoPreservingDigital Historyrdquo The chapter is available fully online atwwwchnmgmuedudigitalhistorypreserving5php ltviewed 19December 2012gt

34 For an overview see ldquoPreserving Our Digital Heritage The NationalDigital Information Infrastructure and Preservation Program 2010 Report a Collaborative Initiative of the Library of CongressrdquoDigitalpreservationgov wwwdigitalpreservationgovmultimediadocumentsNDIIPP2010Report_Postpdf ltviewed 19 December 2012gt

35 Steve Meloan ldquoNo Way to Run a Culturerdquo Wired 13 February 1998available online via the Internet Archivewwwwebarchiveorgweb20000619001705httpwwwwiredcomnewsculture012841030100html ltviewed 19 December 2012gt

36 Some of this has been discussed in greater depth on my website See theseries beginning with Ian Milligan ldquoWARC Files A Challenge forHistorians and Finding Needles in Haystacksrdquo ianmilliganca 12December 2012 wwwianmilliganca20121212warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks ltviewed 20 December2012gt

37 Unstructured data as opposed to structured data is important for histo-rians but brings additional challenges Some projects have producedstructured data novels marked up using XML encoding for example soone can immediately see place names dates character names and soforth This makes it much easier for a computer to read and understand

38 See William Noblett ldquoDigitization A Cautionary Talerdquo New Review ofAcademic Librarianship 17 (2011) 3 Tobias Blanke Michael Bryantand Mark Hedges ldquoOcropodium Open Source OCR for Small-ScaleHistorical Archivesrdquo Journal of Information Science 38 no 1 (2012) 83

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 42: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

(Summer 2005) 6ndash19 also available freely online at the RoyRosenzweig Center for History and New Mediawwwchnmgmueduessays-on-history-new-mediaessaysessayid=39For more see ldquoSeptember 11 Digital Archive Saving the Histories of September 11 2001rdquo Centre for History and New Media andAmerican Social History Project Center for Media and Learningwww911digitalarchiveorg ltviewed 11 January 2013gt ldquoHurricaneDigital Memory Bank Collecting and Preserving the Stories of Katrina and Ritardquo Center for History and New Media wwwhurricanearchiveorg ltviewed 11 January 2013gt and ldquoOccupyArchive Archiving the Occupy Movements from 2011rdquo Centre forHistory and New Media wwwoccupyarchiveorg ltviewed 11 January2013gt

33 Daniel Cohen and Roy Rosenzweig Digital History A Guide to Gathering Preserving and Presenting the Past on the Web(Philadelphia Temple University Press 2005) chapter on ldquoPreservingDigital Historyrdquo The chapter is available fully online atwwwchnmgmuedudigitalhistorypreserving5php ltviewed 19December 2012gt

34 For an overview see ldquoPreserving Our Digital Heritage The NationalDigital Information Infrastructure and Preservation Program 2010 Report a Collaborative Initiative of the Library of CongressrdquoDigitalpreservationgov wwwdigitalpreservationgovmultimediadocumentsNDIIPP2010Report_Postpdf ltviewed 19 December 2012gt

35 Steve Meloan ldquoNo Way to Run a Culturerdquo Wired 13 February 1998available online via the Internet Archivewwwwebarchiveorgweb20000619001705httpwwwwiredcomnewsculture012841030100html ltviewed 19 December 2012gt

36 Some of this has been discussed in greater depth on my website See theseries beginning with Ian Milligan ldquoWARC Files A Challenge forHistorians and Finding Needles in Haystacksrdquo ianmilliganca 12December 2012 wwwianmilliganca20121212warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks ltviewed 20 December2012gt

37 Unstructured data as opposed to structured data is important for histo-rians but brings additional challenges Some projects have producedstructured data novels marked up using XML encoding for example soone can immediately see place names dates character names and soforth This makes it much easier for a computer to read and understand

38 See William Noblett ldquoDigitization A Cautionary Talerdquo New Review ofAcademic Librarianship 17 (2011) 3 Tobias Blanke Michael Bryantand Mark Hedges ldquoOcropodium Open Source OCR for Small-ScaleHistorical Archivesrdquo Journal of Information Science 38 no 1 (2012) 83

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 43: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

and Donald S Macqueen ldquoDeveloping Methods for Very-Large-ScaleSearches in Proquest Historical Newspapers Collections and InfotracThe Times Digital Archive The Case of Two Million versus TwoMillionsrdquo Journal of English Linguistics 32 no 2 (June 2004) 127

39 Elizabeth Krug ldquoCanadarsquos Digital Collections Sharing the CanadianIdentity on the Internetrdquo The Archivist [Library and Archives Canadapublication] April 2000 wwwcollectionscanadagccapublicationsarchivistmagazine015002-2170-ehtml ltviewed 11 January 2013gt

40 ldquoWayBackMachine Beta wwwcollectionsicgccardquo Internet ArchiveAvailable onlinewwwwaybackarchiveorgwebhttpcollectionsicgcca ltviewed 11January 2013gt

41 ldquoStreet Youth Go Onlinerdquo News Release from Industry Canada InternetArchive 15 November 1996 Available onlinewwwwebarchiveorgweb20001011131150httpinfoicgccacmbWelcomeicnsfffc979db07de58e6852564e400603639514a8d746a34803385256612004d90b3OpenDocument ltviewed 11 January 2013gt

42 Darlene Fichter ldquoProjects for Libraries and Archivesrdquo Canadarsquos DigitalCollection Webpage 24 November 2000 through Internet Archivewwwwebarchiveorgweb200012091312httpcollectionsicgccaEcdc_projhtm ltviewed 18 June 2012gt

43 ldquoCanadarsquos Digital Collections homepagerdquo 21 November 2003 through Internet Archive wwwwebarchiveorgweb20031121213939httpcollectionsicgccaEindexphpvo=15 ltviewed 18 June 2012gt

44 ldquoImportant news about Canadarsquos Digital Collectionsrdquo Canadarsquos DigitalCollections Webpage 18 November 2006gt through Internet Archivewwwwebarchiveorgweb20061118125255httpcollectionsicgccaEviewhtml ltviewed 18 June 2012gt

45 For more information see the information at ldquoWolfram Mathematica9rdquo Wolfram Research wwwwolframcommathematica ltviewed 11January 2013gt An example of work in Mathematica can be seen inWilliam J Turkelrsquos demonstration of ldquoTerm Weighting with TF-IDFrdquodrawing on Old Bailey proceedings It is available at wwwdemonstrationswolframcomTermWeightingWithTFIDFltviewed 11 January 2013gt

46 It was 8 degrees Celsius at 500 pm on 18 December 1983 ldquoEdithrdquowas more popular than Emma until around 1910 and now ldquoEmmardquo isthe third most popular name in the United States for new births in2010 the Canadian unemployment rate in March 1985 was 106 per-cent 25th highest in the world and the Canadian GDP in December1995 was $6039 billion per year 10th largest in the world

47 It was formerly known as ldquoVoyeur Toolsrdquo before a 2011 name change Itis available online at wwwvoyant-toolsorg

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 44: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

48 For more information see ldquoSoftware Environment for the Advancementof Scholarly Research (SEASR)rdquo wwwseasrorg ltviewed 18 June2012gt

49 The MALLET toolkit can also be run independently of SEASR For more on topic modeling specifically see ldquoTopic Modelingrdquo MALLET Machine Learning for Language Toolkitwwwmalletcsumassedutopicsphp ltviewed 18 June 2012gt

50 For a good overview see ldquoAnalytics Dunningrsquos Log-Likelihood StatisticrdquoMONK Metadata Offer New Knowledge wwwgautamlisillinoisedumonkmiddlewarepublicanalyticsdunningshtml ltviewed 18 June2012gt

51 ldquoImportant Noticesrdquo Library and Archives Website last updated 30 July2010 wwwcollectionscanadagccanoticesindex-ehtml ltviewed 18June 2012gt Historians will have to increasingly learn to navigate theTerms of Services (TOS) of born-digital source repositories They can beusually found on index or splash pages under headings as varied asldquoTerms of Servicerdquo ldquoPermissionsrdquo ldquoImportant Noticesrdquo or others

52 The command was wget mdashm mdash-cookies=on mdash-keep-session-cookies mdash-load-cookies=cookietxt mdash-referrer=httpepelac-bacgcca100205301iccdccensusdefaulthtm mdashrandom-wait mdashwait=1 mdashlimit-rate=30K httpepelac-bacgcca100205301iccdcEAlphabetasp This loadeda previous cookie load as there is a ldquosplash redirectrdquo page warning aboutdysfunctional archived content It also imposed a random wait andbandwidth cap to be a good netizen

53 For a discussion of this see Christopher D Manning PrabhakarRaghavan Hinrich Scuumltze An Introduction to Information Retrieval(Cambridge UK Cambridge University Press 2009) 27ndash8 Availableonline at wwwnlpstanfordeduIR-book

54 David M Blei ldquoIntroduction to Probabilistic Topic Modelsrdquowwwcsprincetonedu~bleipapersBlei2011pdf ltviewed 18 June2012gt

55 The research was actually carried out on a rather dated late 2007MacBook Pro

56 See John Bonnett ldquoHigh-Performance Computing An Agenda for theSocial Sciences and the Humanities in Canadardquo Digital Studies 1 no 2(2009) wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt and John Bonnett Geoffrey Rockwell and KyleKuchmey ldquoHigh Performance Computing in the arts and HumanitiesrdquoSHARCNET Shared Hierarchical Academic Research Computing Network2006 wwwsharcnetcaDocumentsHHPChpcdhhtml ltviewed 18December 2012gt

57 ldquoSHARCNET History of SHARCNETrdquo SHARCNET 2009

MINING THE lsquoINTERNET GRAVEYARDrsquo RETHINKING THE HISTORIANSrsquo TOOLKIT

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC

Page 45: Mining the ‘Internet Graveyard’: Rethinking the …...en histoire et dans le développement professionnel des historiens. The information produced and consumed by humankind used

wwwsharcnetcamyabouthistory ltviewed 18 December 2012gt 58 ldquoPriority Areasrdquo SSHRC last modified 4 November 2011

wwwsshrc-crshgccafunding-financementprograms-programmespriority_areas-domaines_prioritairesindex-engaspx ltviewed 18 June2012gt

59 Eloquently discussed in Adam Crymble ldquoIs Digital Humanities a Fieldof Researchrdquo Thoughts on Digital amp Public History 11 August 2011wwwadamcrymbleblogspotcom201108is-digital-humanities-field-of-researchhtml ltviewed 18 June 2012gt

JOURNAL OF THE CHA 2012 REVUE DE LA SHC