Professional Documents
Culture Documents
CHAPTER1
Usingneuralnetstorecognizehandwrittendigits
Thehumanvisualsystemisoneofthewondersoftheworld.Considerthe NeuralNetworksandDeepLearning
Whatthisbookisabout
followingsequenceofhandwrittendigits: Ontheexercisesandproblems
Usingneuralnetstorecognize
handwrittendigits
Howthebackpropagationalgorithm
works
Mostpeopleeffortlesslyrecognizethosedigitsas504192.Thateaseis
Improvingthewayneuralnetworks
deceptive.Ineachhemisphereofourbrain,humanshaveaprimaryvisual learn
Avisualproofthatneuralnetscan
cortex,alsoknownasV1,containing140millionneurons,withtensof
computeanyfunction
billionsofconnectionsbetweenthem.Andyethumanvisioninvolvesnot Whyaredeepneuralnetworkshardto
justV1,butanentireseriesofvisualcorticesV2,V3,V4,andV5 train?
Deeplearning
doingprogressivelymorecompleximageprocessing.Wecarryinour Appendix:Isthereasimplealgorithm
headsasupercomputer,tunedbyevolutionoverhundredsofmillionsof forintelligence?
Acknowledgements
years,andsuperblyadaptedtounderstandthevisualworld.Recognizing
FrequentlyAskedQuestions
handwrittendigitsisn'teasy.Rather,wehumansarestupendously,
astoundinglygoodatmakingsenseofwhatoureyesshowus.Butnearly Sponsors
allthatworkisdoneunconsciously.Andsowedon'tusuallyappreciate
howtoughaproblemourvisualsystemssolve.
Thedifficultyofvisualpatternrecognitionbecomesapparentifyou
attempttowriteacomputerprogramtorecognizedigitslikethoseabove.
Whatseemseasywhenwedoitourselvessuddenlybecomesextremely
difficult.Simpleintuitionsabouthowwerecognizeshapes"a9hasa
loopatthetop,andaverticalstrokeinthebottomright"turnouttobe Thankstoallthesupporterswhomade
notsosimpletoexpressalgorithmically.Whenyoutrytomakesuchrules thebookpossible,withespecialthanks
toPavelDudrenov.Thanksalsotoall
precise,youquicklygetlostinamorassofexceptionsandcaveatsand thecontributorstotheBugfinderHallof
specialcases.Itseemshopeless. Fame.
Neuralnetworksapproachtheprobleminadifferentway.Theideaisto Resources
BookFAQ
takealargenumberofhandwrittendigits,knownastrainingexamples,
Coderepository
MichaelNielsen'sproject
announcementmailinglist
DeepLearning,draftbookin
preparation,byYoshuaBengio,Ian
Goodfellow,andAaronCourville
http://neuralnetworksanddeeplearning.com/chap1.html 1/47
1/10/2017 Neural networks and deep learning
ByMichaelNielsen/Jan2017
andthendevelopasystemwhichcanlearnfromthosetrainingexamples.
Inotherwords,theneuralnetworkusestheexamplestoautomatically
inferrulesforrecognizinghandwrittendigits.Furthermore,byincreasing
thenumberoftrainingexamples,thenetworkcanlearnmoreabout
handwriting,andsoimproveitsaccuracy.SowhileI'veshownjust100
trainingdigitsabove,perhapswecouldbuildabetterhandwriting
recognizerbyusingthousandsorevenmillionsorbillionsoftraining
examples.
Inthischapterwe'llwriteacomputerprogramimplementinganeural
networkthatlearnstorecognizehandwrittendigits.Theprogramisjust74
lineslong,andusesnospecialneuralnetworklibraries.Butthisshort
programcanrecognizedigitswithanaccuracyover96percent,without
humanintervention.Furthermore,inlaterchapterswe'lldevelopideas
whichcanimproveaccuracytoover99percent.Infact,thebest
commercialneuralnetworksarenowsogoodthattheyareusedbybanks
toprocesscheques,andbypostofficestorecognizeaddresses.
We'refocusingonhandwritingrecognitionbecauseit'sanexcellent
prototypeproblemforlearningaboutneuralnetworksingeneral.Asa
prototypeithitsasweetspot:it'schallengingit'snosmallfeatto
recognizehandwrittendigitsbutit'snotsodifficultastorequirean
extremelycomplicatedsolution,ortremendouscomputationalpower.
Furthermore,it'sagreatwaytodevelopmoreadvancedtechniques,such
asdeeplearning.Andsothroughoutthebookwe'llreturnrepeatedlytothe
problemofhandwritingrecognition.Laterinthebook,we'lldiscusshow
theseideasmaybeappliedtootherproblemsincomputervision,andalso
inspeech,naturallanguageprocessing,andotherdomains.
http://neuralnetworksanddeeplearning.com/chap1.html 2/47
1/10/2017 Neural networks and deep learning
Ofcourse,ifthepointofthechapterwasonlytowriteacomputer
programtorecognizehandwrittendigits,thenthechapterwouldbemuch
shorter!Butalongthewaywe'lldevelopmanykeyideasaboutneural
networks,includingtwoimportanttypesofartificialneuron(the
perceptronandthesigmoidneuron),andthestandardlearningalgorithm
forneuralnetworks,knownasstochasticgradientdescent.Throughout,I
focusonexplainingwhythingsaredonethewaytheyare,andonbuilding
yourneuralnetworksintuition.Thatrequiresalengthierdiscussionthanif
Ijustpresentedthebasicmechanicsofwhat'sgoingon,butit'sworthitfor
thedeeperunderstandingyou'llattain.Amongstthepayoffs,bytheendof
thechapterwe'llbeinpositiontounderstandwhatdeeplearningis,and
whyitmatters.
Perceptrons
Whatisaneuralnetwork?Togetstarted,I'llexplainatypeofartificial
neuroncalledaperceptron.Perceptronsweredevelopedinthe1950sand
1960sbythescientistFrankRosenblatt,inspiredbyearlierworkby
WarrenMcCullochandWalterPitts.Today,it'smorecommontouseother
modelsofartificialneuronsinthisbook,andinmuchmodernworkon
neuralnetworks,themainneuronmodelusedisonecalledthesigmoid
neuron.We'llgettosigmoidneuronsshortly.Buttounderstandwhy
sigmoidneuronsaredefinedthewaytheyare,it'sworthtakingthetimeto
firstunderstandperceptrons.
Sohowdoperceptronswork?Aperceptrontakesseveralbinaryinputs,
x1 , x2 , ,andproducesasinglebinaryoutput:
Intheexampleshowntheperceptronhasthreeinputs,x 1, x2 , x3 .Ingeneral
itcouldhavemoreorfewerinputs.Rosenblattproposedasimpleruleto
computetheoutput.Heintroducedweights,w 1, w2 , ,realnumbers
expressingtheimportanceoftherespectiveinputstotheoutput.The
neuron'soutput,0or1,isdeterminedbywhethertheweightedsum j
wj x j
islessthanorgreaterthansomethresholdvalue.Justliketheweights,the
thresholdisarealnumberwhichisaparameteroftheneuron.Toputitin
moreprecisealgebraicterms:
http://neuralnetworksanddeeplearning.com/chap1.html 3/47
1/10/2017 Neural networks and deep learning
0 if j
wj x j threshold
output = (1)
{1 if j
wj x j > threshold
That'sallthereistohowaperceptronworks!
That'sthebasicmathematicalmodel.Awayyoucanthinkaboutthe
perceptronisthatit'sadevicethatmakesdecisionsbyweighingup
evidence.Letmegiveanexample.It'snotaveryrealisticexample,butit's
easytounderstand,andwe'llsoongettomorerealisticexamples.Suppose
theweekendiscomingup,andyou'veheardthatthere'sgoingtobea
cheesefestivalinyourcity.Youlikecheese,andaretryingtodecide
whetherornottogotothefestival.Youmightmakeyourdecisionby
weighingupthreefactors:
1.Istheweathergood?
2.Doesyourboyfriendorgirlfriendwanttoaccompanyyou?
3.Isthefestivalnearpublictransit?(Youdon'townacar).
Wecanrepresentthesethreefactorsbycorrespondingbinaryvariables
x1 , x2 ,andx .Forinstance,we'dhavex
3 1 = 1 iftheweatherisgood,and
x1 = 0 iftheweatherisbad.Similarly,x 2 = 1 ifyourboyfriendor
girlfriendwantstogo,andx 2 = 0 ifnot.Andsimilarlyagainforx and 3
publictransit.
Now,supposeyouabsolutelyadorecheese,somuchsothatyou'rehappy
togotothefestivalevenifyourboyfriendorgirlfriendisuninterestedand
thefestivalishardtogetto.Butperhapsyoureallyloathebadweather,
andthere'snowayyou'dgotothefestivaliftheweatherisbad.Youcan
useperceptronstomodelthiskindofdecisionmaking.Onewaytodothis
istochooseaweightw 1 fortheweather,andw
= 6 2 andw
= 2 3 = 2 for
theotherconditions.Thelargervalueofw indicatesthattheweather
1
mattersalottoyou,muchmorethanwhetheryourboyfriendorgirlfriend
joinsyou,orthenearnessofpublictransit.Finally,supposeyouchoosea
thresholdof5fortheperceptron.Withthesechoices,theperceptron
implementsthedesireddecisionmakingmodel,outputting1wheneverthe
weatherisgood,and0whenevertheweatherisbad.Itmakesno
differencetotheoutputwhetheryourboyfriendorgirlfriendwantstogo,
orwhetherpublictransitisnearby.
Byvaryingtheweightsandthethreshold,wecangetdifferentmodelsof
decisionmaking.Forexample,supposeweinsteadchoseathresholdof3.
Thentheperceptronwoulddecidethatyoushouldgotothefestival
http://neuralnetworksanddeeplearning.com/chap1.html 4/47
1/10/2017 Neural networks and deep learning
whenevertheweatherwasgoodorwhenboththefestivalwasnearpublic
transitandyourboyfriendorgirlfriendwaswillingtojoinyou.Inother
words,it'dbeadifferentmodelofdecisionmaking.Droppingthe
thresholdmeansyou'remorewillingtogotothefestival.
Obviously,theperceptronisn'tacompletemodelofhumandecision
making!Butwhattheexampleillustratesishowaperceptroncanweigh
updifferentkindsofevidenceinordertomakedecisions.Anditshould
seemplausiblethatacomplexnetworkofperceptronscouldmakequite
subtledecisions:
Inthisnetwork,thefirstcolumnofperceptronswhatwe'llcallthefirst
layerofperceptronsismakingthreeverysimpledecisions,byweighing
theinputevidence.Whatabouttheperceptronsinthesecondlayer?Each
ofthoseperceptronsismakingadecisionbyweighinguptheresultsfrom
thefirstlayerofdecisionmaking.Inthiswayaperceptroninthesecond
layercanmakeadecisionatamorecomplexandmoreabstractlevelthan
perceptronsinthefirstlayer.Andevenmorecomplexdecisionscanbe
madebytheperceptroninthethirdlayer.Inthisway,amanylayer
networkofperceptronscanengageinsophisticateddecisionmaking.
Incidentally,whenIdefinedperceptronsIsaidthataperceptronhasjusta
singleoutput.Inthenetworkabovetheperceptronslookliketheyhave
multipleoutputs.Infact,they'restillsingleoutput.Themultipleoutput
arrowsaremerelyausefulwayofindicatingthattheoutputfroma
perceptronisbeingusedastheinputtoseveralotherperceptrons.It'sless
unwieldythandrawingasingleoutputlinewhichthensplits.
Let'ssimplifythewaywedescribeperceptrons.Thecondition
j
wj x j > threshold iscumbersome,andwecanmaketwonotational
changestosimplifyit.Thefirstchangeistowrite j
wj x j asadotproduct,
w x j
wj x j ,wherewandxarevectorswhosecomponentsarethe
weightsandinputs,respectively.Thesecondchangeistomovethe
thresholdtotheothersideoftheinequality,andtoreplaceitbywhat's
http://neuralnetworksanddeeplearning.com/chap1.html 5/47
1/10/2017 Neural networks and deep learning
0 ifw x+b0
output = (2)
{1 ifw x + b > 0
Youcanthinkofthebiasasameasureofhoweasyitistogetthe
perceptrontooutputa1.Ortoputitinmorebiologicalterms,thebiasisa
measureofhoweasyitistogettheperceptrontofire.Foraperceptron
withareallybigbias,it'sextremelyeasyfortheperceptrontooutputa1.
Butifthebiasisverynegative,thenit'sdifficultfortheperceptronto
outputa1.Obviously,introducingthebiasisonlyasmallchangeinhow
wedescribeperceptrons,butwe'llseelaterthatitleadstofurther
notationalsimplifications.Becauseofthis,intheremainderofthebook
wewon'tusethethreshold,we'llalwaysusethebias.
I'vedescribedperceptronsasamethodforweighingevidencetomake
decisions.Anotherwayperceptronscanbeusedistocomputethe
elementarylogicalfunctionsweusuallythinkofasunderlying
computation,functionssuchasAND,OR,andNAND.Forexample,suppose
wehaveaperceptronwithtwoinputs,eachwithweight2,andanoverall
biasof3.Here'sourperceptron:
Thenweseethatinput00 producesoutput1,since
(2) 0 + (2) 0 + 3 = 3ispositive.Here,I'veintroducedthesymbol
tomakethemultiplicationsexplicit.Similarcalculationsshowthatthe
inputs01 and10 produceoutput1.Buttheinput11 producesoutput0,
since(2) 1 + (2) 1 + 3 = 1isnegative.Andsoourperceptron
implementsaNANDgate!
TheNANDexampleshowsthatwecanuseperceptronstocomputesimple
logicalfunctions.Infact,wecanusenetworksofperceptronstocompute
anylogicalfunctionatall.ThereasonisthattheNANDgateisuniversal
forcomputation,thatis,wecanbuildanycomputationupoutofNAND
gates.Forexample,wecanuseNANDgatestobuildacircuitwhichadds
twobits,x andx .Thisrequirescomputingthebitwisesum,x x ,as
1 2 1 2
carrybitisjustthebitwiseproductx 1 x2 :
http://neuralnetworksanddeeplearning.com/chap1.html 6/47
1/10/2017 Neural networks and deep learning
TogetanequivalentnetworkofperceptronswereplacealltheNANDgates
byperceptronswithtwoinputs,eachwithweight2,andanoverallbias
of3.Here'stheresultingnetwork.NotethatI'vemovedtheperceptron
correspondingtothebottomrightNANDgatealittle,justtomakeiteasier
todrawthearrowsonthediagram:
Onenotableaspectofthisnetworkofperceptronsisthattheoutputfrom
theleftmostperceptronisusedtwiceasinputtothebottommost
perceptron.WhenIdefinedtheperceptronmodelIdidn'tsaywhetherthis
kindofdoubleoutputtothesameplacewasallowed.Actually,itdoesn't
muchmatter.Ifwedon'twanttoallowthiskindofthing,thenit'spossible
tosimplymergethetwolines,intoasingleconnectionwithaweightof4
insteadoftwoconnectionswith2weights.(Ifyoudon'tfindthisobvious,
youshouldstopandprovetoyourselfthatthisisequivalent.)Withthat
change,thenetworklooksasfollows,withallunmarkedweightsequalto
2,allbiasesequalto3,andasingleweightof4,asmarked:
theleftofthenetworkofperceptrons.Infact,it'sconventionaltodrawan
extralayerofperceptronstheinputlayertoencodetheinputs:
http://neuralnetworksanddeeplearning.com/chap1.html 7/47
1/10/2017 Neural networks and deep learning
Thisnotationforinputperceptrons,inwhichwehaveanoutput,butno
inputs,
isashorthand.Itdoesn'tactuallymeanaperceptronwithnoinputs.Tosee
this,supposewedidhaveaperceptronwithnoinputs.Thentheweighted
sum j
wj x j wouldalwaysbezero,andsotheperceptronwouldoutput1
ifb > 0,and0ifb 0.Thatis,theperceptronwouldsimplyoutputa
fixedvalue,notthedesiredvalue(x ,intheexampleabove).It'sbetterto
1
thinkoftheinputperceptronsasnotreallybeingperceptronsatall,but
ratherspecialunitswhicharesimplydefinedtooutputthedesiredvalues,
x1 , x2 , .
Theadderexampledemonstrateshowanetworkofperceptronscanbe
usedtosimulateacircuitcontainingmanyNANDgates.Andbecause
NANDgatesareuniversalforcomputation,itfollowsthatperceptronsare
alsouniversalforcomputation.
Thecomputationaluniversalityofperceptronsissimultaneously
reassuringanddisappointing.It'sreassuringbecauseittellsusthat
networksofperceptronscanbeaspowerfulasanyothercomputing
device.Butit'salsodisappointing,becauseitmakesitseemasthough
perceptronsaremerelyanewtypeofNANDgate.That'shardlybignews!
However,thesituationisbetterthanthisviewsuggests.Itturnsoutthat
wecandeviselearningalgorithmswhichcanautomaticallytunethe
weightsandbiasesofanetworkofartificialneurons.Thistuninghappens
inresponsetoexternalstimuli,withoutdirectinterventionbya
programmer.Theselearningalgorithmsenableustouseartificialneurons
inawaywhichisradicallydifferenttoconventionallogicgates.Insteadof
explicitlylayingoutacircuitofNANDandothergates,ourneural
networkscansimplylearntosolveproblems,sometimesproblemswhere
itwouldbeextremelydifficulttodirectlydesignaconventionalcircuit.
Sigmoidneurons
http://neuralnetworksanddeeplearning.com/chap1.html 8/47
1/10/2017 Neural networks and deep learning
Learningalgorithmssoundterrific.Buthowcanwedevisesuch
algorithmsforaneuralnetwork?Supposewehaveanetworkof
perceptronsthatwe'dliketousetolearntosolvesomeproblem.For
example,theinputstothenetworkmightbetherawpixeldatafroma
scanned,handwrittenimageofadigit.Andwe'dlikethenetworktolearn
weightsandbiasessothattheoutputfromthenetworkcorrectlyclassifies
thedigit.Toseehowlearningmightwork,supposewemakeasmall
changeinsomeweight(orbias)inthenetwork.Whatwe'dlikeisforthis
smallchangeinweighttocauseonlyasmallcorrespondingchangeinthe
outputfromthenetwork.Aswe'llseeinamoment,thispropertywill
makelearningpossible.Schematically,here'swhatwewant(obviously
thisnetworkistoosimpletodohandwritingrecognition!):
Ifitweretruethatasmallchangeinaweight(orbias)causesonlyasmall
changeinoutput,thenwecouldusethisfacttomodifytheweightsand
biasestogetournetworktobehavemoreinthemannerwewant.For
example,supposethenetworkwasmistakenlyclassifyinganimageasan
"8"whenitshouldbea"9".Wecouldfigureouthowtomakeasmall
changeintheweightsandbiasessothenetworkgetsalittlecloserto
classifyingtheimageasa"9".Andthenwe'drepeatthis,changingthe
weightsandbiasesoverandovertoproducebetterandbetteroutput.The
networkwouldbelearning.
Theproblemisthatthisisn'twhathappenswhenournetworkcontains
perceptrons.Infact,asmallchangeintheweightsorbiasofanysingle
perceptroninthenetworkcansometimescausetheoutputofthat
perceptrontocompletelyflip,sayfrom0to1.Thatflipmaythencause
thebehaviouroftherestofthenetworktocompletelychangeinsome
verycomplicatedway.Sowhileyour"9"mightnowbeclassified
correctly,thebehaviourofthenetworkonalltheotherimagesislikelyto
havecompletelychangedinsomehardtocontrolway.Thatmakesit
difficulttoseehowtograduallymodifytheweightsandbiasessothatthe
http://neuralnetworksanddeeplearning.com/chap1.html 9/47
1/10/2017 Neural networks and deep learning
networkgetsclosertothedesiredbehaviour.Perhapsthere'ssomeclever
wayofgettingaroundthisproblem.Butit'snotimmediatelyobvioushow
wecangetanetworkofperceptronstolearn.
Wecanovercomethisproblembyintroducinganewtypeofartificial
neuroncalledasigmoidneuron.Sigmoidneuronsaresimilarto
perceptrons,butmodifiedsothatsmallchangesintheirweightsandbias
causeonlyasmallchangeintheiroutput.That'sthecrucialfactwhichwill
allowanetworkofsigmoidneuronstolearn.
Okay,letmedescribethesigmoidneuron.We'lldepictsigmoidneuronsin
thesamewaywedepictedperceptrons:
Justlikeaperceptron,thesigmoidneuronhasinputs,x 1, x2 , .But
insteadofbeingjust0or1,theseinputscanalsotakeonanyvalues
between0and1.So,forinstance,0.638 isavalidinputforasigmoid
neuron.Alsojustlikeaperceptron,thesigmoidneuronhasweightsfor
eachinput,w 1, w2 , ,andanoverallbias, b .Buttheoutputisnot0or1.
*Incidentally, issometimescalledthelogistic
Instead,it's (w x + b),where iscalledthesigmoidfunction*,andis
function,andthisnewclassofneuronscalledlogistic
definedby: neurons.It'susefultorememberthisterminology,
sincethesetermsareusedbymanypeopleworking
withneuralnets.However,we'llstickwiththe
1
(z) z . (3) sigmoidterminology.
1 + e
Toputitallalittlemoreexplicitly,theoutputofasigmoidneuronwith
inputsx 1, x2 , ,weights w1 , w2 , ,andbias b is
1
. (4)
1 + exp( j
wj x j b)
Atfirstsight,sigmoidneuronsappearverydifferenttoperceptrons.The
algebraicformofthesigmoidfunctionmayseemopaqueandforbiddingif
you'renotalreadyfamiliarwithit.Infact,therearemanysimilarities
betweenperceptronsandsigmoidneurons,andthealgebraicformofthe
sigmoidfunctionturnsouttobemoreofatechnicaldetailthanatrue
barriertounderstanding.
Tounderstandthesimilaritytotheperceptronmodel,supposez w x + b
isalargepositivenumber.Thene 0andso (z) 1 .Inotherwords,
z
whenz = w x + bislargeandpositive,theoutputfromthesigmoid
http://neuralnetworksanddeeplearning.com/chap1.html 10/47
1/10/2017 Neural networks and deep learning
neuronisapproximately1,justasitwouldhavebeenforaperceptron.
Supposeontheotherhandthatz = w x + bisverynegative.Then
e
z ,and (z) 0 .Sowhenz = w x + bisverynegative,the
behaviourofasigmoidneuronalsocloselyapproximatesaperceptron.It's
onlywhenw x + b isofmodestsizethatthere'smuchdeviationfromthe
perceptronmodel.
Whataboutthealgebraicformof ?Howcanweunderstandthat?Infact,
theexactformof isn'tsoimportantwhatreallymattersistheshapeof
thefunctionwhenplotted.Here'stheshape:
sigmoidfunction
1.0
0.8
0.6
0.4
0.2
0.0
4 3 2 1 0 1 2 3 4
z
Thisshapeisasmoothedoutversionofastepfunction:
stepfunction
1.0
0.8
0.6
0.4
0.2
0.0
4 3 2 1 0 1 2 3 4
z
If hadinfactbeenastepfunction,thenthesigmoidneuronwouldbea
perceptron,sincetheoutputwouldbe1or0dependingonwhether
*Actually,whenw x + b theperceptronoutputs
w x + b waspositiveornegative*.Byusingtheactual functionweget, 0
= 0
,whilethestepfunctionoutputs1 .So,strictly
asalreadyimpliedabove,asmoothedoutperceptron.Indeed,it'sthe speaking,we'dneedtomodifythestepfunctionat
thatonepoint.Butyougettheidea.
smoothnessofthe functionthatisthecrucialfact,notitsdetailedform.
Thesmoothnessof meansthatsmallchanges wj intheweightsand
b
output
http://neuralnetworksanddeeplearning.com/chap1.html output 11/47
1/10/2017 Neural networks and deep learning
output output
output wj + b, (5)
w b
j
j
denotepartialderivativesoftheoutputwithrespecttow andb, j
respectively.Don'tpanicifyou'renotcomfortablewithpartialderivatives!
Whiletheexpressionabovelookscomplicated,withallthepartial
derivatives,it'sactuallysayingsomethingverysimple(andwhichisvery
goodnews): output isalinearfunctionofthechanges wj and
b inthe
weightsandbias.Thislinearitymakesiteasytochoosesmallchangesin
theweightsandbiasestoachieveanydesiredsmallchangeintheoutput.
Sowhilesigmoidneuronshavemuchofthesamequalitativebehaviouras
perceptrons,theymakeitmucheasiertofigureouthowchangingthe
weightsandbiaseswillchangetheoutput.
Ifit'stheshapeof whichreallymatters,andnotitsexactform,thenwhy
usetheparticularformusedfor inEquation(3)?Infact,laterinthebook
wewilloccasionallyconsiderneuronswheretheoutputisf (w x + b)for
someotheractivationfunctionf ().Themainthingthatchangeswhenwe
useadifferentactivationfunctionisthattheparticularvaluesforthe
partialderivativesinEquation(5)change.Itturnsoutthatwhenwe
computethosepartialderivativeslater,using willsimplifythealgebra,
simplybecauseexponentialshavelovelypropertieswhendifferentiated.In
anycase, iscommonlyusedinworkonneuralnets,andistheactivation
functionwe'llusemostofteninthisbook.
Howshouldweinterprettheoutputfromasigmoidneuron?Obviously,
onebigdifferencebetweenperceptronsandsigmoidneuronsisthat
sigmoidneuronsdon'tjustoutput0or1.Theycanhaveasoutputanyreal
numberbetween0and1,sovaluessuchas0.173 and0.689 are
legitimateoutputs.Thiscanbeuseful,forexample,ifwewanttousethe
outputvaluetorepresenttheaverageintensityofthepixelsinanimage
inputtoaneuralnetwork.Butsometimesitcanbeanuisance.Supposewe
wanttheoutputfromthenetworktoindicateeither"theinputimageisa
9"or"theinputimageisnota9".Obviously,it'dbeeasiesttodothisifthe
outputwasa0ora1,asinaperceptron.Butinpracticewecansetupa
conventiontodealwiththis,forexample,bydecidingtointerpretany
outputofatleast0.5asindicatinga"9",andanyoutputlessthan0.5as
indicating"nota9".I'llalwaysexplicitlystatewhenwe'reusingsucha
convention,soitshouldn'tcauseanyconfusion.
http://neuralnetworksanddeeplearning.com/chap1.html 12/47
1/10/2017 Neural networks and deep learning
Exercises
Sigmoidneuronssimulatingperceptrons,partI
Supposewetakealltheweightsandbiasesinanetworkof
perceptrons,andmultiplythembyapositiveconstant,c > 0.Show
thatthebehaviourofthenetworkdoesn'tchange.
Sigmoidneuronssimulatingperceptrons,partII
Supposewehavethesamesetupasthelastproblemanetworkof
perceptrons.Supposealsothattheoverallinputtothenetworkof
perceptronshasbeenchosen.Wewon'tneedtheactualinputvalue,
wejustneedtheinputtohavebeenfixed.Supposetheweightsand
biasesaresuchthatw x + b 0fortheinputxtoanyparticular
perceptroninthenetwork.Nowreplacealltheperceptronsinthe
networkbysigmoidneurons,andmultiplytheweightsandbiasesby
apositiveconstantc > 0.Showthatinthelimitasc the
behaviourofthisnetworkofsigmoidneuronsisexactlythesameas
thenetworkofperceptrons.Howcanthisfailwhenw x + b = 0for
oneoftheperceptrons?
Thearchitectureofneuralnetworks
InthenextsectionI'llintroduceaneuralnetworkthatcandoaprettygood
jobclassifyinghandwrittendigits.Inpreparationforthat,ithelpsto
explainsometerminologythatletsusnamedifferentpartsofanetwork.
Supposewehavethenetwork:
Asmentionedearlier,theleftmostlayerinthisnetworkiscalledtheinput
layer,andtheneuronswithinthelayerarecalledinputneurons.The
rightmostoroutputlayercontainstheoutputneurons,or,asinthiscase,a
singleoutputneuron.Themiddlelayeriscalledahiddenlayer,sincethe
neuronsinthislayerareneitherinputsnoroutputs.Theterm"hidden"
perhapssoundsalittlemysteriousthefirsttimeIheardthetermIthought
itmusthavesomedeepphilosophicalormathematicalsignificancebutit
reallymeansnothingmorethan"notaninputoranoutput".Thenetwork
http://neuralnetworksanddeeplearning.com/chap1.html 13/47
1/10/2017 Neural networks and deep learning
abovehasjustasinglehiddenlayer,butsomenetworkshavemultiple
hiddenlayers.Forexample,thefollowingfourlayernetworkhastwo
hiddenlayers:
Somewhatconfusingly,andforhistoricalreasons,suchmultiplelayer
networksaresometimescalledmultilayerperceptronsorMLPs,despite
beingmadeupofsigmoidneurons,notperceptrons.I'mnotgoingtouse
theMLPterminologyinthisbook,sinceIthinkit'sconfusing,butwanted
towarnyouofitsexistence.
Thedesignoftheinputandoutputlayersinanetworkisoften
straightforward.Forexample,supposewe'retryingtodeterminewhethera
handwrittenimagedepictsa"9"ornot.Anaturalwaytodesignthe
networkistoencodetheintensitiesoftheimagepixelsintotheinput
neurons.Iftheimageisa64 by64 greyscaleimage,thenwe'dhave
4, 096 = 64 64 inputneurons,withtheintensitiesscaledappropriately
between0and1.Theoutputlayerwillcontainjustasingleneuron,with
outputvaluesoflessthan0.5indicating"inputimageisnota9",and
valuesgreaterthan0.5indicating"inputimageisa9".
Whilethedesignoftheinputandoutputlayersofaneuralnetworkis
oftenstraightforward,therecanbequiteanarttothedesignofthehidden
layers.Inparticular,it'snotpossibletosumupthedesignprocessforthe
hiddenlayerswithafewsimplerulesofthumb.Instead,neuralnetworks
researchershavedevelopedmanydesignheuristicsforthehiddenlayers,
whichhelppeoplegetthebehaviourtheywantoutoftheirnets.For
example,suchheuristicscanbeusedtohelpdeterminehowtotradeoff
thenumberofhiddenlayersagainstthetimerequiredtotrainthenetwork.
We'llmeetseveralsuchdesignheuristicslaterinthisbook.
Uptonow,we'vebeendiscussingneuralnetworkswheretheoutputfrom
onelayerisusedasinputtothenextlayer.Suchnetworksarecalled
http://neuralnetworksanddeeplearning.com/chap1.html 14/47
1/10/2017 Neural networks and deep learning
feedforwardneuralnetworks.Thismeanstherearenoloopsinthe
networkinformationisalwaysfedforward,neverfedback.Ifwedid
haveloops,we'dendupwithsituationswheretheinputtothe function
dependedontheoutput.That'dbehardtomakesenseof,andsowedon't
allowsuchloops.
However,thereareothermodelsofartificialneuralnetworksinwhich
feedbackloopsarepossible.Thesemodelsarecalledrecurrentneural
networks.Theideainthesemodelsistohaveneuronswhichfireforsome
limiteddurationoftime,beforebecomingquiescent.Thatfiringcan
stimulateotherneurons,whichmayfirealittlewhilelater,alsofora
limitedduration.Thatcausesstillmoreneuronstofire,andsoovertime
wegetacascadeofneuronsfiring.Loopsdon'tcauseproblemsinsucha
model,sinceaneuron'soutputonlyaffectsitsinputatsomelatertime,not
instantaneously.
Recurrentneuralnetshavebeenlessinfluentialthanfeedforward
networks,inpartbecausethelearningalgorithmsforrecurrentnetsare(at
leasttodate)lesspowerful.Butrecurrentnetworksarestillextremely
interesting.They'remuchcloserinspirittohowourbrainsworkthan
feedforwardnetworks.Andit'spossiblethatrecurrentnetworkscansolve
importantproblemswhichcanonlybesolvedwithgreatdifficultyby
feedforwardnetworks.However,tolimitourscope,inthisbookwe're
goingtoconcentrateonthemorewidelyusedfeedforwardnetworks.
Asimplenetworktoclassify
handwrittendigits
Havingdefinedneuralnetworks,let'sreturntohandwritingrecognition.
Wecansplittheproblemofrecognizinghandwrittendigitsintotwosub
problems.First,we'dlikeawayofbreakinganimagecontainingmany
digitsintoasequenceofseparateimages,eachcontainingasingledigit.
Forexample,we'dliketobreaktheimage
intosixseparateimages,
http://neuralnetworksanddeeplearning.com/chap1.html 15/47
1/10/2017 Neural networks and deep learning
Wehumanssolvethissegmentationproblemwithease,butit'schallenging
foracomputerprogramtocorrectlybreakuptheimage.Oncetheimage
hasbeensegmented,theprogramthenneedstoclassifyeachindividual
digit.So,forinstance,we'dlikeourprogramtorecognizethatthefirst
digitabove,
isa5.
We'llfocusonwritingaprogramtosolvethesecondproblem,thatis,
classifyingindividualdigits.Wedothisbecauseitturnsoutthatthe
segmentationproblemisnotsodifficulttosolve,onceyouhaveagood
wayofclassifyingindividualdigits.Therearemanyapproachestosolving
thesegmentationproblem.Oneapproachistotrialmanydifferentwaysof
segmentingtheimage,usingtheindividualdigitclassifiertoscoreeach
trialsegmentation.Atrialsegmentationgetsahighscoreiftheindividual
digitclassifierisconfidentofitsclassificationinallsegments,andalow
scoreiftheclassifierishavingalotoftroubleinoneormoresegments.
Theideaisthatiftheclassifierishavingtroublesomewhere,thenit's
probablyhavingtroublebecausethesegmentationhasbeenchosen
incorrectly.Thisideaandothervariationscanbeusedtosolvethe
segmentationproblemquitewell.Soinsteadofworryingabout
segmentationwe'llconcentrateondevelopinganeuralnetworkwhichcan
solvethemoreinterestinganddifficultproblem,namely,recognizing
individualhandwrittendigits.
Torecognizeindividualdigitswewilluseathreelayerneuralnetwork:
http://neuralnetworksanddeeplearning.com/chap1.html 16/47
1/10/2017 Neural networks and deep learning
Theinputlayerofthenetworkcontainsneuronsencodingthevaluesofthe
inputpixels.Asdiscussedinthenextsection,ourtrainingdataforthe
networkwillconsistofmany28 by28 pixelimagesofscanned
handwrittendigits,andsotheinputlayercontains784 = 28 28neurons.
ForsimplicityI'veomittedmostofthe784inputneuronsinthediagram
above.Theinputpixelsaregreyscale,withavalueof0.0representing
white,avalueof1.0representingblack,andinbetweenvalues
representinggraduallydarkeningshadesofgrey.
Thesecondlayerofthenetworkisahiddenlayer.Wedenotethenumber
ofneuronsinthishiddenlayerbyn,andwe'llexperimentwithdifferent
valuesforn.Theexampleshownillustratesasmallhiddenlayer,
containingjustn = 15neurons.
Theoutputlayerofthenetworkcontains10neurons.Ifthefirstneuron
fires,i.e.,hasanoutput 1,thenthatwillindicatethatthenetworkthinks
thedigitisa0.Ifthesecondneuronfiresthenthatwillindicatethatthe
networkthinksthedigitisa1.Andsoon.Alittlemoreprecisely,we
numbertheoutputneuronsfrom0through9,andfigureoutwhichneuron
hasthehighestactivationvalue.Ifthatneuronis,say,neuronnumber6,
thenournetworkwillguessthattheinputdigitwasa6.Andsoonforthe
otheroutputneurons.
Youmightwonderwhyweuse10 outputneurons.Afterall,thegoalof
thenetworkistotelluswhichdigit(0, 1, 2, , 9)correspondstotheinput
image.Aseeminglynaturalwayofdoingthatistousejust4output
neurons,treatingeachneuronastakingonabinaryvalue,dependingon
http://neuralnetworksanddeeplearning.com/chap1.html 17/47
1/10/2017 Neural networks and deep learning
whethertheneuron'soutputiscloserto0orto1.Fourneuronsareenough
toencodetheanswer,since2 4
= 16 ismorethanthe10possiblevalues
fortheinputdigit.Whyshouldournetworkuse10 neuronsinstead?Isn't
thatinefficient?Theultimatejustificationisempirical:wecantryoutboth
networkdesigns,anditturnsoutthat,forthisparticularproblem,the
networkwith10 outputneuronslearnstorecognizedigitsbetterthanthe
networkwith4outputneurons.Butthatleavesuswonderingwhyusing10
outputneuronsworksbetter.Istheresomeheuristicthatwouldtellusin
advancethatweshouldusethe10 outputencodinginsteadofthe4output
encoding?
Tounderstandwhywedothis,ithelpstothinkaboutwhattheneural
networkisdoingfromfirstprinciples.Considerfirstthecasewherewe
use10 outputneurons.Let'sconcentrateonthefirstoutputneuron,theone
that'stryingtodecidewhetherornotthedigitisa0.Itdoesthisby
weighingupevidencefromthehiddenlayerofneurons.Whatarethose
hiddenneuronsdoing?Well,justsupposeforthesakeofargumentthatthe
firstneuroninthehiddenlayerdetectswhetherornotanimagelikethe
followingispresent:
Itcandothisbyheavilyweightinginputpixelswhichoverlapwiththe
image,andonlylightlyweightingtheotherinputs.Inasimilarway,let's
supposeforthesakeofargumentthatthesecond,third,andfourthneurons
inthehiddenlayerdetectwhetherornotthefollowingimagesarepresent:
Asyoumayhaveguessed,thesefourimagestogethermakeupthe0
imagethatwesawinthelineofdigitsshownearlier:
http://neuralnetworksanddeeplearning.com/chap1.html 18/47
1/10/2017 Neural networks and deep learning
Soifallfourofthesehiddenneuronsarefiringthenwecanconcludethat
thedigitisa0.Ofcourse,that'snottheonlysortofevidencewecanuseto
concludethattheimagewasa0wecouldlegitimatelygeta0inmany
otherways(say,throughtranslationsoftheaboveimages,orslight
distortions).Butitseemssafetosaythatatleastinthiscasewe'dconclude
thattheinputwasa0.
Supposingtheneuralnetworkfunctionsinthisway,wecangivea
plausibleexplanationforwhyit'sbettertohave10 outputsfromthe
network,ratherthan4.Ifwehad4outputs,thenthefirstoutputneuron
wouldbetryingtodecidewhatthemostsignificantbitofthedigitwas.
Andthere'snoeasywaytorelatethatmostsignificantbittosimpleshapes
likethoseshownabove.It'shardtoimaginethatthere'sanygood
historicalreasonthecomponentshapesofthedigitwillbecloselyrelated
to(say)themostsignificantbitintheoutput.
Now,withallthatsaid,thisisalljustaheuristic.Nothingsaysthatthe
threelayerneuralnetworkhastooperateinthewayIdescribed,withthe
hiddenneuronsdetectingsimplecomponentshapes.Maybeaclever
learningalgorithmwillfindsomeassignmentofweightsthatletsususe
only4outputneurons.ButasaheuristicthewayofthinkingI've
describedworksprettywell,andcansaveyoualotoftimeindesigning
goodneuralnetworkarchitectures.
Exercise
Thereisawayofdeterminingthebitwiserepresentationofadigitby
addinganextralayertothethreelayernetworkabove.Theextra
layerconvertstheoutputfromthepreviouslayerintoabinary
representation,asillustratedinthefigurebelow.Findasetofweights
andbiasesforthenewoutputlayer.Assumethatthefirst3layersof
neuronsaresuchthatthecorrectoutputinthethirdlayer(i.e.,theold
outputlayer)hasactivationatleast0.99,andincorrectoutputshave
activationlessthan0.01.
http://neuralnetworksanddeeplearning.com/chap1.html 19/47
1/10/2017 Neural networks and deep learning
Learningwithgradientdescent
Nowthatwehaveadesignforourneuralnetwork,howcanitlearnto
recognizedigits?Thefirstthingwe'llneedisadatasettolearnfroma
socalledtrainingdataset.We'llusetheMNISTdataset,whichcontains
tensofthousandsofscannedimagesofhandwrittendigits,togetherwith
theircorrectclassifications.MNIST'snamecomesfromthefactthatitisa
modifiedsubsetoftwodatasetscollectedbyNIST,theUnitedStates'
NationalInstituteofStandardsandTechnology.Here'safewimagesfrom
MNIST:
Asyoucansee,thesedigitsare,infact,thesameasthoseshownatthe
beginningofthischapterasachallengetorecognize.Ofcourse,when
testingournetworkwe'llaskittorecognizeimageswhicharen'tinthe
trainingset!
TheMNISTdatacomesintwoparts.Thefirstpartcontains60,000images
tobeusedastrainingdata.Theseimagesarescannedhandwritingsamples
from250people,halfofwhomwereUSCensusBureauemployees,and
halfofwhomwerehighschoolstudents.Theimagesaregreyscaleand28
by28pixelsinsize.ThesecondpartoftheMNISTdatasetis10,000
imagestobeusedastestdata.Again,theseare28by28greyscaleimages.
We'llusethetestdatatoevaluatehowwellourneuralnetworkhaslearned
torecognizedigits.Tomakethisagoodtestofperformance,thetestdata
wastakenfromadifferentsetof250peoplethantheoriginaltrainingdata
(albeitstillagroupsplitbetweenCensusBureauemployeesandhigh
schoolstudents).Thishelpsgiveusconfidencethatoursystemcan
recognizedigitsfrompeoplewhosewritingitdidn'tseeduringtraining.
http://neuralnetworksanddeeplearning.com/chap1.html 20/47
1/10/2017 Neural networks and deep learning
We'llusethenotationxtodenoteatraininginput.It'llbeconvenientto
regardeachtraininginputxasa28 28 = 784dimensionalvector.Each
entryinthevectorrepresentsthegreyvalueforasinglepixelintheimage.
We'lldenotethecorrespondingdesiredoutputbyy = y(x) ,whereyisa10
dimensionalvector.Forexample,ifaparticulartrainingimage,x,depicts
a6,theny(x) = (0, 0, 0, 0, 0, 0, 1, 0, 0, 0) isthedesiredoutputfromthe
T
network.NotethatT hereisthetransposeoperation,turningarowvector
intoanordinary(column)vector.
Whatwe'dlikeisanalgorithmwhichletsusfindweightsandbiasesso
thattheoutputfromthenetworkapproximatesy(x)foralltraininginputs
x .Toquantifyhowwellwe'reachievingthisgoalwedefineacost
*Sometimesreferredtoasalossorobjective
function*:
function.Weusethetermcostfunctionthroughout
thisbook,butyoushouldnotetheotherterminology,
C(w, b)
1
2n
y(x) a
2
. (6)
sinceit'softenusedinresearchpapersandother
discussionsofneuralnetworks.
x
Here,wdenotesthecollectionofallweightsinthenetwork,ballthe
biases,nisthetotalnumberoftraininginputs,aisthevectorofoutputs
fromthenetworkwhenxisinput,andthesumisoveralltraininginputs,
x .Ofcourse,theoutputadependsonx,wandb,buttokeepthenotation
simpleIhaven'texplicitlyindicatedthisdependence.Thenotationvjust
denotestheusuallengthfunctionforavectorv.We'llcallCthequadratic
costfunctionit'salsosometimesknownasthemeansquarederrororjust
MSE.Inspectingtheformofthequadraticcostfunction,weseethat
C(w, b) isnonnegative,sinceeveryterminthesumisnonnegative.
Furthermore,thecostC(w, b)becomessmall,i.e.,C(w, b) 0,precisely
wheny(x)isapproximatelyequaltotheoutput,a,foralltraininginputs,x.
Soourtrainingalgorithmhasdoneagoodjobifitcanfindweightsand
biasessothatC(w, b) 0.Bycontrast,it'snotdoingsowellwhenC(w, b)
islargethatwouldmeanthaty(x)isnotclosetotheoutputaforalarge
numberofinputs.Sotheaimofourtrainingalgorithmwillbetominimize
thecostC(w, b)asafunctionoftheweightsandbiases.Inotherwords,we
wanttofindasetofweightsandbiaseswhichmakethecostassmallas
possible.We'lldothatusinganalgorithmknownasgradientdescent.
Whyintroducethequadraticcost?Afterall,aren'tweprimarilyinterested
inthenumberofimagescorrectlyclassifiedbythenetwork?Whynottry
tomaximizethatnumberdirectly,ratherthanminimizingaproxymeasure
likethequadraticcost?Theproblemwiththatisthatthenumberof
imagescorrectlyclassifiedisnotasmoothfunctionoftheweightsand
biasesinthenetwork.Forthemostpart,makingsmallchangestothe
http://neuralnetworksanddeeplearning.com/chap1.html 21/47
1/10/2017 Neural networks and deep learning
weightsandbiaseswon'tcauseanychangeatallinthenumberoftraining
imagesclassifiedcorrectly.Thatmakesitdifficulttofigureouthowto
changetheweightsandbiasestogetimprovedperformance.Ifweinstead
useasmoothcostfunctionlikethequadraticcostitturnsouttobeeasyto
figureouthowtomakesmallchangesintheweightsandbiasessoasto
getanimprovementinthecost.That'swhywefocusfirstonminimizing
thequadraticcost,andonlyafterthatwillweexaminetheclassification
accuracy.
Evengiventhatwewanttouseasmoothcostfunction,youmaystill
wonderwhywechoosethequadraticfunctionusedinEquation(6).Isn't
thisaratheradhocchoice?Perhapsifwechoseadifferentcostfunction
we'dgetatotallydifferentsetofminimizingweightsandbiases?Thisisa
validconcern,andlaterwe'llrevisitthecostfunction,andmakesome
modifications.However,thequadraticcostfunctionofEquation(6)works
perfectlywellforunderstandingthebasicsoflearninginneuralnetworks,
sowe'llstickwithitfornow.
Recapping,ourgoalintraininganeuralnetworkistofindweightsand
biaseswhichminimizethequadraticcostfunctionC(w, b).Thisisawell
posedproblem,butit'sgotalotofdistractingstructureascurrentlyposed
theinterpretationofwandbasweightsandbiases,the functionlurking
inthebackground,thechoiceofnetworkarchitecture,MNIST,andsoon.
Itturnsoutthatwecanunderstandatremendousamountbyignoringmost
ofthatstructure,andjustconcentratingontheminimizationaspect.Sofor
nowwe'regoingtoforgetallaboutthespecificformofthecostfunction,
theconnectiontoneuralnetworks,andsoon.Instead,we'regoingto
imaginethatwe'vesimplybeengivenafunctionofmanyvariablesandwe
wanttominimizethatfunction.We'regoingtodevelopatechniquecalled
gradientdescentwhichcanbeusedtosolvesuchminimizationproblems.
Thenwe'llcomebacktothespecificfunctionwewanttominimizefor
neuralnetworks.
Okay,let'ssupposewe'retryingtominimizesomefunction,C(v) .This
couldbeanyrealvaluedfunctionofmanyvariables,v = v 1, v2 , .Note
thatI'vereplacedthewandbnotationbyvtoemphasizethatthiscouldbe
anyfunctionwe'renotspecificallythinkingintheneuralnetworks
contextanymore.TominimizeC(v) ithelpstoimagineCasafunctionof
justtwovariables,whichwe'llcallv andv :
1 2
http://neuralnetworksanddeeplearning.com/chap1.html 22/47
1/10/2017 Neural networks and deep learning
Whatwe'dlikeistofindwhereCachievesitsglobalminimum.Now,of
course,forthefunctionplottedabove,wecaneyeballthegraphandfind
theminimum.Inthatsense,I'veperhapsshownslightlytoosimplea
function!Ageneralfunction,C,maybeacomplicatedfunctionofmany
variables,anditwon'tusuallybepossibletojusteyeballthegraphtofind
theminimum.
Onewayofattackingtheproblemistousecalculustotrytofindthe
minimumanalytically.Wecouldcomputederivativesandthentryusing
themtofindplaceswhereCisanextremum.Withsomeluckthatmight
workwhenCisafunctionofjustoneorafewvariables.Butit'llturninto
anightmarewhenwehavemanymorevariables.Andforneuralnetworks
we'lloftenwantfarmorevariablesthebiggestneuralnetworkshavecost
functionswhichdependonbillionsofweightsandbiasesinanextremely
complicatedway.Usingcalculustominimizethatjustwon'twork!
(Afterassertingthatwe'llgaininsightbyimaginingCasafunctionofjust
twovariables,I'veturnedaroundtwiceintwoparagraphsandsaid,"hey,
butwhatifit'safunctionofmanymorethantwovariables?"Sorryabout
that.PleasebelievemewhenIsaythatitreallydoeshelptoimagineCas
afunctionoftwovariables.Itjusthappensthatsometimesthatpicture
breaksdown,andthelasttwoparagraphsweredealingwithsuch
breakdowns.Goodthinkingaboutmathematicsofteninvolvesjuggling
multipleintuitivepictures,learningwhenit'sappropriatetouseeach
picture,andwhenit'snot.)
Okay,socalculusdoesn'twork.Fortunately,thereisabeautifulanalogy
whichsuggestsanalgorithmwhichworksprettywell.Westartbythinking
http://neuralnetworksanddeeplearning.com/chap1.html 23/47
1/10/2017 Neural networks and deep learning
ofourfunctionasakindofavalley.Ifyousquintjustalittleattheplot
above,thatshouldn'tbetoohard.Andweimagineaballrollingdownthe
slopeofthevalley.Oureverydayexperiencetellsusthattheballwill
eventuallyrolltothebottomofthevalley.Perhapswecanusethisideaas
awaytofindaminimumforthefunction?We'drandomlychoosea
startingpointforan(imaginary)ball,andthensimulatethemotionofthe
ballasitrolleddowntothebottomofthevalley.Wecoulddothis
simulationsimplybycomputingderivatives(andperhapssomesecond
derivatives)ofCthosederivativeswouldtelluseverythingweneedto
knowaboutthelocal"shape"ofthevalley,andthereforehowourball
shouldroll.
BasedonwhatI'vejustwritten,youmightsupposethatwe'llbetryingto
writedownNewton'sequationsofmotionfortheball,consideringthe
effectsoffrictionandgravity,andsoon.Actually,we'renotgoingtotake
theballrollinganalogyquitethatseriouslywe'redevisinganalgorithm
tominimizeC,notdevelopinganaccuratesimulationofthelawsof
physics!Theball'seyeviewismeanttostimulateourimagination,not
constrainourthinking.Soratherthangetintoallthemessydetailsof
physics,let'ssimplyaskourselves:ifweweredeclaredGodforaday,and
couldmakeupourownlawsofphysics,dictatingtotheballhowitshould
roll,whatlaworlawsofmotioncouldwepickthatwouldmakeitsothe
ballalwaysrolledtothebottomofthevalley?
Tomakethisquestionmoreprecise,let'sthinkaboutwhathappenswhen
wemovetheballasmallamount v1 inthev direction,andasmall
1
follows:
C C
C v1 + v2 . (7)
v v
1 2
negativei.e.,we'llchoosethemsotheballisrollingdownintothevalley.
Tofigureouthowtomakesuchachoiceithelpstodefine v tobethe
vectorofchangesinv, v ( v1 , v2 )
T
,whereT isagainthetranspose
operation,turningrowvectorsintocolumnvectors.We'llalsodefinethe
gradientofCtobethevectorofpartialderivatives,( C T
C
v1
,
v
2
) .We
denotethegradientvectorbyC,i.e.:
C C T
C ( v
, . (8)
v )
1 2
http://neuralnetworksanddeeplearning.com/chap1.html 24/47
1/10/2017 Neural networks and deep learning
Withthesedefinitions,theexpression(7)for C
canberewrittenas
C C v. (9)
ThisequationhelpsexplainwhyCiscalledthegradientvector:C
relateschangesinvtochangesinC,justaswe'dexpectsomethingcalleda
gradienttodo.Butwhat'sreallyexcitingabouttheequationisthatitlets
usseehowtochoose v soastomake C negative.Inparticular,suppose
wechoose
v = C, (10)
whereisasmall,positiveparameter(knownasthelearningrate).Then
Equation(9)tellsusthat C C C = C .Because
2
C
2
0,thisguaranteesthat C 0,i.e.,Cwillalwaysdecrease,
neverincrease,ifwechangevaccordingtotheprescriptionin(10).
(Within,ofcourse,thelimitsoftheapproximationinEquation(9)).Thisis
exactlythepropertywewanted!Andsowe'lltakeEquation(10)todefine
the"lawofmotion"fortheballinourgradientdescentalgorithm.Thatis,
we'lluseEquation(10)tocomputeavaluefor v ,thenmovetheball's
positionvbythatamount:
v v
= v C. (11)
Thenwe'llusethisupdateruleagain,tomakeanothermove.Ifwekeep
doingthis,overandover,we'llkeepdecreasingCuntilwehopewe
reachaglobalminimum.
Summingup,thewaythegradientdescentalgorithmworksisto
repeatedlycomputethegradientC,andthentomoveintheopposite
http://neuralnetworksanddeeplearning.com/chap1.html 25/47
1/10/2017 Neural networks and deep learning
direction,"fallingdown"theslopeofthevalley.Wecanvisualizeitlike
this:
Noticethatwiththisrulegradientdescentdoesn'treproducerealphysical
motion.Inreallifeaballhasmomentum,andthatmomentummayallow
ittorollacrosstheslope,oreven(momentarily)rolluphill.It'sonlyafter
theeffectsoffrictionsetinthattheballisguaranteedtorolldownintothe
valley.Bycontrast,ourruleforchoosing v justsays"godown,right
now".That'sstillaprettygoodruleforfindingtheminimum!
Tomakegradientdescentworkcorrectly,weneedtochoosethelearning
ratetobesmallenoughthatEquation(9)isagoodapproximation.Ifwe
don't,wemightendupwith C > 0 ,whichobviouslywouldnotbegood!
Atthesametime,wedon'twanttobetoosmall,sincethatwillmakethe
changes v tiny,andthusthegradientdescentalgorithmwillworkvery
slowly.Inpracticalimplementations,isoftenvariedsothatEquation(9)
remainsagoodapproximation,butthealgorithmisn'ttooslow.We'llsee
laterhowthisworks.
I'veexplainedgradientdescentwhenCisafunctionofjusttwovariables.
But,infact,everythingworksjustaswellevenwhenCisafunctionof
manymorevariables.SupposeinparticularthatCisafunctionofm
variables,v 1, , vm .Thenthechange C inCproducedbyasmall
change v = ( v1 , , vm )
T
is
C C v, (12)
wherethegradientCisthevector
T
http://neuralnetworksanddeeplearning.com/chap1.html 26/47
1/10/2017 Neural networks and deep learning
C C
, ,
T
C ( v
. (13)
v
1 m
)
Justasforthetwovariablecase,wecanchoose
v = C, (14)
andwe'reguaranteedthatour(approximate)expression(12)for C will
benegative.Thisgivesusawayoffollowingthegradienttoaminimum,
evenwhenCisafunctionofmanyvariables,byrepeatedlyapplyingthe
updaterule
v v
= v C. (15)
Youcanthinkofthisupdateruleasdefiningthegradientdescent
algorithm.Itgivesusawayofrepeatedlychangingthepositionvinorder
tofindaminimumofthefunctionC.Theruledoesn'talwayswork
severalthingscangowrongandpreventgradientdescentfromfindingthe
globalminimumofC,apointwe'llreturntoexploreinlaterchapters.But,
inpracticegradientdescentoftenworksextremelywell,andinneural
networkswe'llfindthatit'sapowerfulwayofminimizingthecost
function,andsohelpingthenetlearn.
Indeed,there'sevenasenseinwhichgradientdescentistheoptimal
strategyforsearchingforaminimum.Let'ssupposethatwe'retryingto
makeamove v inpositionsoastodecreaseCasmuchaspossible.This
isequivalenttominimizing C C v .We'llconstrainthesizeofthe
movesothat v = forsomesmallfixed > 0.Inotherwords,we
wantamovethatisasmallstepofafixedsize,andwe'retryingtofindthe
movementdirectionwhichdecreasesCasmuchaspossible.Itcanbe
provedthatthechoiceof v whichminimizesC v
is
v = C,
where = /Cisdeterminedbythesizeconstraint
v = .So
gradientdescentcanbeviewedasawayoftakingsmallstepsinthe
directionwhichdoesthemosttoimmediatelydecreaseC.
Exercises
Provetheassertionofthelastparagraph.Hint:Ifyou'renotalready
familiarwiththeCauchySchwarzinequality,youmayfindithelpful
tofamiliarizeyourselfwithit.
IexplainedgradientdescentwhenCisafunctionoftwovariables,
andwhenit'safunctionofmorethantwovariables.Whathappens
whenCisafunctionofjustonevariable?Canyouprovidea
http://neuralnetworksanddeeplearning.com/chap1.html 27/47
1/10/2017 Neural networks and deep learning
geometricinterpretationofwhatgradientdescentisdoingintheone
dimensionalcase?
Peoplehaveinvestigatedmanyvariationsofgradientdescent,including
variationsthatmorecloselymimicarealphysicalball.Theseball
mimickingvariationshavesomeadvantages,butalsohaveamajor
disadvantage:itturnsouttobenecessarytocomputesecondpartial
derivativesofC,andthiscanbequitecostly.Toseewhyit'scostly,
supposewewanttocomputeallthesecondpartialderivatives 2
v .
C/ v j k
Ifthereareamillionsuchv variablesthenwe'dneedtocompute
j
somethinglikeatrillion(i.e.,amillionsquared)secondpartial
derivatives*!That'sgoingtobecomputationallycostly.Withthatsaid,
*Actually,morelikehalfatrillion,since
therearetricksforavoidingthiskindofproblem,andfindingalternatives 2
v
C/ vj k =
2
v .Still,yougetthepoint.
C/ vk j
togradientdescentisanactiveareaofinvestigation.Butinthisbookwe'll
usegradientdescent(andvariations)asourmainapproachtolearningin
neuralnetworks.
Howcanweapplygradientdescenttolearninaneuralnetwork?Theidea
istousegradientdescenttofindtheweightsw andbiasesb which k l
minimizethecostinEquation(6).Toseehowthisworks,let'srestatethe
gradientdescentupdaterule,withtheweightsandbiasesreplacingthe
variablesv .Inotherwords,our"position"nowhascomponentsw andb ,
j k l
andthegradientvectorChascorrespondingcomponentsC/w and k
C/b .Writingoutthegradientdescentupdateruleintermsof
l
components,wehave
C
wk w
k
= wk
w
(16)
k
C
bl b
l
= bl
b
. (17)
l
Byrepeatedlyapplyingthisupdaterulewecan"rolldownthehill",and
hopefullyfindaminimumofthecostfunction.Inotherwords,thisisa
rulewhichcanbeusedtolearninaneuralnetwork.
Thereareanumberofchallengesinapplyingthegradientdescentrule.
We'lllookintothoseindepthinlaterchapters.ButfornowIjustwantto
mentiononeproblem.Tounderstandwhattheproblemis,let'slookback
atthequadraticcostinEquation(6).Noticethatthiscostfunctionhasthe
2
x x x
n 2
individualtrainingexamples.Inpractice,tocomputethegradientCwe
needtocomputethegradientsC separatelyforeachtraininginput,x,
x
andthenaveragethem,C =
1
n
C .Unfortunately,whenthenumber
x x
http://neuralnetworksanddeeplearning.com/chap1.html 28/47
1/10/2017 Neural networks and deep learning
oftraininginputsisverylargethiscantakealongtime,andlearningthus
occursslowly.
Anideacalledstochasticgradientdescentcanbeusedtospeedup
learning.TheideaistoestimatethegradientCbycomputingC fora x
smallsampleofrandomlychosentraininginputs.Byaveragingoverthis
smallsampleitturnsoutthatwecanquicklygetagoodestimateofthe
truegradientC,andthishelpsspeedupgradientdescent,andthus
learning.
Tomaketheseideasmoreprecise,stochasticgradientdescentworksby
randomlypickingoutasmallnumbermofrandomlychosentraining
inputs.We'lllabelthoserandomtraininginputsX 1, X2 , , Xm ,andreferto
themasaminibatch.Providedthesamplesizemislargeenoughwe
expectthattheaveragevalueoftheC willberoughlyequaltothe
Xj
averageoverallC ,thatis,
x
m
C Xj C
C,
j=1 x x
= (18)
m n
wherethesecondsumisovertheentiresetoftrainingdata.Swapping
sidesweget
m
1
C C Xj , (19)
m
j=1
confirmingthatwecanestimatetheoverallgradientbycomputing
gradientsjustfortherandomlychosenminibatch.
Toconnectthisexplicitlytolearninginneuralnetworks,supposew andb k l
denotetheweightsandbiasesinourneuralnetwork.Thenstochastic
gradientdescentworksbypickingoutarandomlychosenminibatchof
traininginputs,andtrainingwiththose,
C
wk w
k
= wk
m w
Xj
(20)
k
j
C
bl b
l
= bl
m b
Xj
, (21)
l
j
wherethesumsareoverallthetrainingexamplesX inthecurrentmini j
batch.Thenwepickoutanotherrandomlychosenminibatchandtrain
withthose.Andsoon,untilwe'veexhaustedthetraininginputs,whichis
saidtocompleteanepochoftraining.Atthatpointwestartoverwitha
newtrainingepoch.
http://neuralnetworksanddeeplearning.com/chap1.html 29/47
1/10/2017 Neural networks and deep learning
Incidentally,it'sworthnotingthatconventionsvaryaboutscalingofthe
costfunctionandofminibatchupdatestotheweightsandbiases.In
Equation(6)wescaledtheoverallcostfunctionbyafactor .People 1
sometimesomitthe ,summingoverthecostsofindividualtraining
1
examplesinsteadofaveraging.Thisisparticularlyusefulwhenthetotal
numberoftrainingexamplesisn'tknowninadvance.Thiscanoccurif
moretrainingdataisbeinggeneratedinrealtime,forinstance.And,ina
similarway,theminibatchupdaterules(20)and(21)sometimesomitthe
1
m
termoutthefrontofthesums.Conceptuallythismakeslittledifference,
sinceit'sequivalenttorescalingthelearningrate.Butwhendoing
detailedcomparisonsofdifferentworkit'sworthwatchingoutfor.
Wecanthinkofstochasticgradientdescentasbeinglikepoliticalpolling:
it'smucheasiertosampleasmallminibatchthanitistoapplygradient
descenttothefullbatch,justascarryingoutapolliseasierthanrunninga
fullelection.Forexample,ifwehaveatrainingsetofsizen = 60, 000,as
inMNIST,andchooseaminibatchsizeof(say)m = 10,thismeanswe'll
getafactorof6, 000speedupinestimatingthegradient!Ofcourse,the
estimatewon'tbeperfecttherewillbestatisticalfluctuationsbutit
doesn'tneedtobeperfect:allwereallycareaboutismovinginageneral
directionthatwillhelpdecreaseC,andthatmeanswedon'tneedanexact
computationofthegradient.Inpractice,stochasticgradientdescentisa
commonlyusedandpowerfultechniqueforlearninginneuralnetworks,
andit'sthebasisformostofthelearningtechniqueswe'lldevelopinthis
book.
Exercise
Anextremeversionofgradientdescentistouseaminibatchsizeof
just1.Thatis,givenatraininginput,x,weupdateourweightsand
biasesaccordingtotherulesw k
w
k
= wk C /w and
x k
bl b
l
= bl C /b .Thenwechooseanothertraininginput,and
x l
updatetheweightsandbiasesagain.Andsoon,repeatedly.This
procedureisknownasonline,online,orincrementallearning.In
onlinelearning,aneuralnetworklearnsfromjustonetraininginput
atatime(justashumanbeingsdo).Nameoneadvantageandone
disadvantageofonlinelearning,comparedtostochasticgradient
descentwithaminibatchsizeof,say,20 .
Letmeconcludethissectionbydiscussingapointthatsometimesbugs
peoplenewtogradientdescent.InneuralnetworksthecostCis,ofcourse,
http://neuralnetworksanddeeplearning.com/chap1.html 30/47
1/10/2017 Neural networks and deep learning
afunctionofmanyvariablesalltheweightsandbiasesandsoinsome
sensedefinesasurfaceinaveryhighdimensionalspace.Somepeopleget
hungupthinking:"Hey,Ihavetobeabletovisualizealltheseextra
dimensions".Andtheymaystarttoworry:"Ican'tthinkinfour
dimensions,letalonefive(orfivemillion)".Istheresomespecialability
they'remissing,someabilitythat"real"supermathematicianshave?Of
course,theanswerisno.Evenmostprofessionalmathematicianscan't
visualizefourdimensionsespeciallywell,ifatall.Thetricktheyuse,
instead,istodevelopotherwaysofrepresentingwhat'sgoingon.That's
exactlywhatwedidabove:weusedanalgebraic(ratherthanvisual)
representationof C tofigureouthowtomovesoastodecreaseC.
Peoplewhoaregoodatthinkinginhighdimensionshaveamentallibrary
containingmanydifferenttechniquesalongtheselinesouralgebraictrick
isjustoneexample.Thosetechniquesmaynothavethesimplicitywe're
accustomedtowhenvisualizingthreedimensions,butonceyoubuildupa
libraryofsuchtechniques,youcangetprettygoodatthinkinginhigh
dimensions.Iwon'tgointomoredetailhere,butifyou'reinterestedthen
youmayenjoyreadingthisdiscussionofsomeofthetechniques
professionalmathematiciansusetothinkinhighdimensions.Whilesome
ofthetechniquesdiscussedarequitecomplex,muchofthebestcontentis
intuitiveandaccessible,andcouldbemasteredbyanyone.
Implementingournetworktoclassify
digits
Alright,let'swriteaprogramthatlearnshowtorecognizehandwritten
digits,usingstochasticgradientdescentandtheMNISTtrainingdata.
We'lldothiswithashortPython(2.7)program,just74linesofcode!The
firstthingweneedistogettheMNISTdata.Ifyou'reagituserthenyou
canobtainthedatabycloningthecoderepositoryforthisbook,
Ifyoudon'tusegitthenyoucandownloadthedataandcodehere.
Incidentally,whenIdescribedtheMNISTdataearlier,Isaiditwassplit
into60,000trainingimages,and10,000testimages.That'stheofficial
MNISTdescription.Actually,we'regoingtosplitthedataalittle
differently.We'llleavethetestimagesasis,butsplitthe60,000image
MNISTtrainingsetintotwoparts:asetof50,000images,whichwe'lluse
totrainourneuralnetwork,andaseparate10,000imagevalidationset.
http://neuralnetworksanddeeplearning.com/chap1.html 31/47
1/10/2017 Neural networks and deep learning
Wewon'tusethevalidationdatainthischapter,butlaterinthebookwe'll
finditusefulinfiguringouthowtosetcertainhyperparametersofthe
neuralnetworkthingslikethelearningrate,andsoon,whicharen't
directlyselectedbyourlearningalgorithm.Althoughthevalidationdata
isn'tpartoftheoriginalMNISTspecification,manypeopleuseMNISTin
thisfashion,andtheuseofvalidationdataiscommoninneuralnetworks.
WhenIrefertothe"MNISTtrainingdata"fromnowon,I'llbereferring
*Asnotedearlier,theMNISTdatasetisbasedontwo
toour50,000imagedataset,nottheoriginal60,000imagedataset*. datasetscollectedbyNIST,theUnitedStates'
NationalInstituteofStandardsandTechnology.To
ApartfromtheMNISTdatawealsoneedaPythonlibrarycalledNumpy, constructMNISTtheNISTdatasetswerestripped
downandputintoamoreconvenientformatbyYann
fordoingfastlinearalgebra.Ifyoudon'talreadyhaveNumpyinstalled, LeCun,CorinnaCortes,andChristopherJ.C.Burges.
Seethislinkformoredetails.Thedatasetinmy
youcangetithere. repositoryisinaformthatmakesiteasytoloadand
manipulatetheMNISTdatainPython.Iobtainedthis
Letmeexplainthecorefeaturesoftheneuralnetworkscode,before particularformofthedatafromtheLISAmachine
learninglaboratoryattheUniversityofMontreal
givingafulllisting,below.ThecenterpieceisaNetworkclass,whichwe (link).
usetorepresentaneuralnetwork.Here'sthecodeweusetoinitializea
Networkobject:
class Network(object):
Inthiscode,thelistsizescontainsthenumberofneuronsinthe
respectivelayers.So,forexample,ifwewanttocreateaNetworkobject
with2neuronsinthefirstlayer,3neuronsinthesecondlayer,and1
neuroninthefinallayer,we'ddothiswiththecode:
ThebiasesandweightsintheNetworkobjectareallinitializedrandomly,
usingtheNumpynp.random.randnfunctiontogenerateGaussian
distributionswithmean0andstandarddeviation1.Thisrandom
initializationgivesourstochasticgradientdescentalgorithmaplaceto
startfrom.Inlaterchapterswe'llfindbetterwaysofinitializingthe
weightsandbiases,butthiswilldofornow.NotethattheNetwork
initializationcodeassumesthatthefirstlayerofneuronsisaninputlayer,
andomitstosetanybiasesforthoseneurons,sincebiasesareonlyever
usedincomputingtheoutputsfromlaterlayers.
NotealsothatthebiasesandweightsarestoredaslistsofNumpy
matrices.So,forexamplenet.weights[1]isaNumpymatrixstoringthe
weightsconnectingthesecondandthirdlayersofneurons.(It'snotthe
http://neuralnetworksanddeeplearning.com/chap1.html 32/47
1/10/2017 Neural networks and deep learning
firstandsecondlayers,sincePython'slistindexingstartsat0.)Since
net.weights[1]isratherverbose,let'sjustdenotethatmatrixw.It'sa
matrixsuchthatw istheweightfortheconnectionbetweenthek
jk
th
neuroninthesecondlayer,andthej neuroninthethirdlayer.This
th
orderingofthejandkindicesmayseemstrangesurelyit'dmakemore
sensetoswapthejandkindicesaround?Thebigadvantageofusingthis
orderingisthatitmeansthatthevectorofactivationsofthethirdlayerof
neuronsis:
a
= (wa + b). (22)
There'squiteabitgoingoninthisequation,solet'sunpackitpieceby
piece.aisthevectorofactivationsofthesecondlayerofneurons.To
obtaina wemultiplyabytheweightmatrixw,andaddthevectorbof
biases.Wethenapplythefunction elementwisetoeveryentryinthe
vectorwa + b.(Thisiscalledvectorizingthefunction .)It'seasytoverify
thatEquation(22)givesthesameresultasourearlierrule,Equation(4),
forcomputingtheoutputofasigmoidneuron.
Exercise
WriteoutEquation(22)incomponentform,andverifythatitgives
thesameresultastherule(4)forcomputingtheoutputofasigmoid
neuron.
Withallthisinmind,it'seasytowritecodecomputingtheoutputfroma
Networkinstance.Webeginbydefiningthesigmoidfunction:
def sigmoid(z):
return 1.0/(1.0+np.exp(-z))
NotethatwhentheinputzisavectororNumpyarray,Numpy
automaticallyappliesthefunctionsigmoidelementwise,thatis,in
vectorizedform.
WethenaddafeedforwardmethodtotheNetworkclass,which,givenan
*Itisassumedthattheinputaisan(n, 1)Numpy
inputaforthenetwork,returnsthecorrespondingoutput*.Allthemethod ndarray,nota(n,)vector.Here,nisthenumberof
doesisappliesEquation(22)foreachlayer: inputstothenetwork.Ifyoutrytousean(n,)
vectorasinputyou'llgetstrangeresults.Although
usingan(n,)vectorappearsthemorenatural
def feedforward(self, a):
choice,usingan(n, 1)ndarraymakesit
"""Return the output of the network if "a" is input."""
particularlyeasytomodifythecodetofeedforward
for b, w in zip(self.biases, self.weights):
multipleinputsatonce,andthatissometimes
a = sigmoid(np.dot(w, a)+b)
convenient.
return a
Ofcourse,themainthingwewantourNetworkobjectstodoistolearn.
Tothatendwe'llgivethemanSGDmethodwhichimplementsstochastic
http://neuralnetworksanddeeplearning.com/chap1.html 33/47
1/10/2017 Neural networks and deep learning
gradientdescent.Here'sthecode.It'salittlemysteriousinafewplaces,
butI'llbreakitdownbelow,afterthelisting.
Thetraining_dataisalistoftuples(x, y)representingthetraining
inputsandcorrespondingdesiredoutputs.Thevariablesepochsand
mini_batch_sizearewhatyou'dexpectthenumberofepochstotrain
for,andthesizeoftheminibatchestousewhensampling.etaisthe
learningrate,.Iftheoptionalargumenttest_dataissupplied,thenthe
programwillevaluatethenetworkaftereachepochoftraining,andprint
outpartialprogress.Thisisusefulfortrackingprogress,butslowsthings
downsubstantially.
Thecodeworksasfollows.Ineachepoch,itstartsbyrandomlyshuffling
thetrainingdata,andthenpartitionsitintominibatchesoftheappropriate
size.Thisisaneasywayofsamplingrandomlyfromthetrainingdata.
Thenforeachmini_batchweapplyasinglestepofgradientdescent.This
isdonebythecodeself.update_mini_batch(mini_batch, eta),which
updatesthenetworkweightsandbiasesaccordingtoasingleiterationof
gradientdescent,usingjustthetrainingdatainmini_batch.Here'sthe
codefortheupdate_mini_batchmethod:
http://neuralnetworksanddeeplearning.com/chap1.html 34/47
1/10/2017 Neural networks and deep learning
nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
self.weights = [w-(eta/len(mini_batch))*nw
for w, nw in zip(self.weights, nabla_w)]
self.biases = [b-(eta/len(mini_batch))*nb
for b, nb in zip(self.biases, nabla_b)]
Mostoftheworkisdonebytheline
delta_nabla_b, delta_nabla_w = self.backprop(x, y)
Thisinvokessomethingcalledthebackpropagationalgorithm,whichisa
fastwayofcomputingthegradientofthecostfunction.So
update_mini_batchworkssimplybycomputingthesegradientsforevery
trainingexampleinthemini_batch,andthenupdatingself.weightsand
self.biasesappropriately.
I'mnotgoingtoshowthecodeforself.backproprightnow.We'llstudy
howbackpropagationworksinthenextchapter,includingthecodefor
self.backprop.Fornow,justassumethatitbehavesasclaimed,returning
theappropriategradientforthecostassociatedtothetrainingexamplex.
Let'slookatthefullprogram,includingthedocumentationstrings,which
Iomittedabove.Apartfromself.backproptheprogramisself
explanatoryalltheheavyliftingisdoneinself.SGDand
self.update_mini_batch,whichwe'vealreadydiscussed.The
self.backpropmethodmakesuseofafewextrafunctionstohelpin
computingthegradient,namelysigmoid_prime,whichcomputesthe
derivativeofthe function,andself.cost_derivative,whichIwon't
describehere.Youcangetthegistofthese(andperhapsthedetails)just
bylookingatthecodeanddocumentationstrings.We'lllookatthemin
detailinthenextchapter.Notethatwhiletheprogramappearslengthy,
muchofthecodeisdocumentationstringsintendedtomakethecodeeasy
tounderstand.Infact,theprogramcontainsjust74linesofnon
whitespace,noncommentcode.AllthecodemaybefoundonGitHub
here.
"""
network.py
~~~~~~~~~~
#### Libraries
# Standard library
import random
http://neuralnetworksanddeeplearning.com/chap1.html 35/47
1/10/2017 Neural networks and deep learning
# Third-party libraries
import numpy as np
class Network(object):
http://neuralnetworksanddeeplearning.com/chap1.html 36/47
1/10/2017 Neural networks and deep learning
def sigmoid_prime(z):
"""Derivative of the sigmoid function."""
return sigmoid(z)*(1-sigmoid(z))
Howwelldoestheprogramrecognizehandwrittendigits?Well,let'sstart
byloadingintheMNISTdata.I'lldothisusingalittlehelperprogram,
mnist_loader.py,tobedescribedbelow.Weexecutethefollowing
commandsinaPythonshell,
http://neuralnetworksanddeeplearning.com/chap1.html 37/47
1/10/2017 Neural networks and deep learning
>>> import mnist_loader
>>> training_data, validation_data, test_data = \
... mnist_loader.load_data_wrapper()
Ofcourse,thiscouldalsobedoneinaseparatePythonprogram,butif
you'refollowingalongit'sprobablyeasiesttodoinaPythonshell.
AfterloadingtheMNISTdata,we'llsetupaNetworkwith30 hidden
neurons.WedothisafterimportingthePythonprogramlistedabove,
whichisnamednetwork,
Finally,we'llusestochasticgradientdescenttolearnfromtheMNIST
training_dataover30epochs,withaminibatchsizeof10,anda
learningrateof = 3.0,
Notethatifyou'rerunningthecodeasyoureadalong,itwilltakesome
timetoexecuteforatypicalmachine(asof2015)itwilllikelytakeafew
minutestorun.Isuggestyousetthingsrunning,continuetoread,and
periodicallychecktheoutputfromthecode.Ifyou'reinarushyoucan
speedthingsupbydecreasingthenumberofepochs,bydecreasingthe
numberofhiddenneurons,orbyusingonlypartofthetrainingdata.Note
thatproductioncodewouldbemuch,muchfaster:thesePythonscriptsare
intendedtohelpyouunderstandhowneuralnetswork,nottobehigh
performancecode!And,ofcourse,oncewe'vetrainedanetworkitcanbe
runveryquicklyindeed,onalmostanycomputingplatform.Forexample,
oncewe'velearnedagoodsetofweightsandbiasesforanetwork,itcan
easilybeportedtoruninJavascriptinawebbrowser,orasanativeapp
onamobiledevice.Inanycase,hereisapartialtranscriptoftheoutputof
onetrainingrunoftheneuralnetwork.Thetranscriptshowsthenumberof
testimagescorrectlyrecognizedbytheneuralnetworkaftereachepochof
training.Asyoucansee,afterjustasingleepochthishasreached9,129
outof10,000,andthenumbercontinuestogrow,
Thatis,thetrainednetworkgivesusaclassificationrateofabout95
percent95.42percentatitspeak("Epoch28")!That'squiteencouraging
http://neuralnetworksanddeeplearning.com/chap1.html 38/47
1/10/2017 Neural networks and deep learning
asafirstattempt.Ishouldwarnyou,however,thatifyourunthecode
thenyourresultsarenotnecessarilygoingtobequitethesameasmine,
sincewe'llbeinitializingournetworkusing(different)randomweights
andbiases.TogenerateresultsinthischapterI'vetakenbestofthreeruns.
Let'sreruntheaboveexperiment,changingthenumberofhiddenneurons
to100.Aswasthecaseearlier,ifyou'rerunningthecodeasyouread
along,youshouldbewarnedthatittakesquiteawhiletoexecute(onmy
machinethisexperimenttakestensofsecondsforeachtrainingepoch),so
it'swisetocontinuereadinginparallelwhilethecodeexecutes.
Sureenough,thisimprovestheresultsto96.59percent.Atleastinthis
*Readerfeedbackindicatesquitesomevariationin
case,usingmorehiddenneuronshelpsusgetbetterresults*.
resultsforthisexperiment,andsometrainingruns
giveresultsquiteabitworse.Usingthetechniques
Ofcourse,toobtaintheseaccuraciesIhadtomakespecificchoicesforthe introducedinchapter3willgreatlyreducethe
variationinperformanceacrossdifferenttrainingruns
numberofepochsoftraining,theminibatchsize,andthelearningrate,. forournetworks.
AsImentionedabove,theseareknownashyperparametersforourneural
network,inordertodistinguishthemfromtheparameters(weightsand
biases)learntbyourlearningalgorithm.Ifwechooseourhyper
parameterspoorly,wecangetbadresults.Suppose,forexample,thatwe'd
chosenthelearningratetobe = 0.001 ,
Theresultsaremuchlessencouraging,
However,youcanseethattheperformanceofthenetworkisgetting
slowlybetterovertime.Thatsuggestsincreasingthelearningrate,sayto
= 0.01.Ifwedothat,wegetbetterresults,whichsuggestsincreasingthe
learningrateagain.(Ifmakingachangeimprovesthings,trydoingmore!)
Ifwedothatseveraltimesover,we'llendupwithalearningrateof
somethinglike = 1.0(andperhapsfinetuneto3.0),whichisclosetoour
earlierexperiments.Soeventhoughweinitiallymadeapoorchoiceof
hyperparameters,weatleastgotenoughinformationtohelpusimprove
ourchoiceofhyperparameters.
http://neuralnetworksanddeeplearning.com/chap1.html 39/47
1/10/2017 Neural networks and deep learning
Ingeneral,debugginganeuralnetworkcanbechallenging.Thisis
especiallytruewhentheinitialchoiceofhyperparametersproduces
resultsnobetterthanrandomnoise.Supposewetrythesuccessful30
hiddenneuronnetworkarchitecturefromearlier,butwiththelearningrate
changedto = 100.0 :
Atthispointwe'veactuallygonetoofar,andthelearningrateistoohigh:
Epoch 0: 1009 / 10000
Epoch 1: 1009 / 10000
Epoch 2: 1009 / 10000
Epoch 3: 1009 / 10000
...
Epoch 27: 982 / 10000
Epoch 28: 982 / 10000
Epoch 29: 982 / 10000
Nowimaginethatwewerecomingtothisproblemforthefirsttime.Of
course,weknowfromourearlierexperimentsthattherightthingtodois
todecreasethelearningrate.Butifwewerecomingtothisproblemfor
thefirsttimethentherewouldn'tbemuchintheoutputtoguideuson
whattodo.Wemightworrynotonlyaboutthelearningrate,butabout
everyotheraspectofourneuralnetwork.Wemightwonderifwe've
initializedtheweightsandbiasesinawaythatmakesithardforthe
networktolearn?Ormaybewedon'thaveenoughtrainingdatatoget
meaningfullearning?Perhapswehaven'trunforenoughepochs?Or
maybeit'simpossibleforaneuralnetworkwiththisarchitecturetolearn
torecognizehandwrittendigits?Maybethelearningrateistoolow?Or,
maybe,thelearningrateistoohigh?Whenyou'recomingtoaproblemfor
thefirsttime,you'renotalwayssure.
Thelessontotakeawayfromthisisthatdebugginganeuralnetworkis
nottrivial,and,justasforordinaryprogramming,thereisanarttoit.You
needtolearnthatartofdebugginginordertogetgoodresultsfromneural
networks.Moregenerally,weneedtodevelopheuristicsforchoosing
goodhyperparametersandagoodarchitecture.We'lldiscussalltheseat
lengththroughthebook,includinghowIchosethehyperparameters
above.
Exercise
Trycreatinganetworkwithjusttwolayersaninputandanoutput
layer,nohiddenlayerwith784and10neurons,respectively.Train
http://neuralnetworksanddeeplearning.com/chap1.html 40/47
1/10/2017 Neural networks and deep learning
thenetworkusingstochasticgradientdescent.Whatclassification
accuracycanyouachieve?
Earlier,IskippedoverthedetailsofhowtheMNISTdataisloaded.It's
prettystraightforward.Forcompleteness,here'sthecode.Thedata
structuresusedtostoretheMNISTdataaredescribedinthe
documentationstringsit'sstraightforwardstuff,tuplesandlistsof
Numpyndarrayobjects(thinkofthemasvectorsifyou'renotfamiliar
withndarrays):
"""
mnist_loader
~~~~~~~~~~~~
A library to load the MNIST image data. For details of the data
structures that are returned, see the doc strings for ``load_data``
and ``load_data_wrapper``. In practice, ``load_data_wrapper`` is the
function usually called by our neural network code.
"""
#### Libraries
# Standard library
import cPickle
import gzip
# Third-party libraries
import numpy as np
def load_data():
"""Return the MNIST data as a tuple containing the training data,
the validation data, and the test data.
This is a nice data format, but for use in neural networks it's
helpful to modify the format of the ``training_data`` a little.
That's done in the wrapper function ``load_data_wrapper()``, see
below.
"""
f = gzip.open('../data/mnist.pkl.gz', 'rb')
training_data, validation_data, test_data = cPickle.load(f)
f.close()
return (training_data, validation_data, test_data)
def load_data_wrapper():
"""Return a tuple containing ``(training_data, validation_data,
test_data)``. Based on ``load_data``, but the format is more
convenient for use in our implementation of neural networks.
http://neuralnetworksanddeeplearning.com/chap1.html 41/47
1/10/2017 Neural networks and deep learning
2-tuples ``(x, y)``. ``x`` is a 784-dimensional numpy.ndarray
containing the input image. ``y`` is a 10-dimensional
numpy.ndarray representing the unit vector corresponding to the
correct digit for ``x``.
def vectorized_result(j):
"""Return a 10-dimensional unit vector with a 1.0 in the jth
position and zeroes elsewhere. This is used to convert a digit
(0...9) into a corresponding desired output from the neural
network."""
e = np.zeros((10, 1))
e[j] = 1.0
return e
Isaidabovethatourprogramgetsprettygoodresults.Whatdoesthat
mean?Goodcomparedtowhat?It'sinformativetohavesomesimple
(nonneuralnetwork)baselineteststocompareagainst,tounderstand
whatitmeanstoperformwell.Thesimplestbaselineofall,ofcourse,isto
randomlyguessthedigit.That'llberightabouttenpercentofthetime.
We'redoingmuchbetterthanthat!
Whataboutalesstrivialbaseline?Let'stryanextremelysimpleidea:we'll
lookathowdarkanimageis.Forinstance,animageofa2willtypically
bequiteabitdarkerthananimageofa1,justbecausemorepixelsare
blackenedout,asthefollowingexamplesillustrate:
Thissuggestsusingthetrainingdatatocomputeaveragedarknessesfor
eachdigit,0, 1, 2, , 9.Whenpresentedwithanewimage,wecompute
howdarktheimageis,andthenguessthatit'swhicheverdigithasthe
closestaveragedarkness.Thisisasimpleprocedure,andiseasytocode
http://neuralnetworksanddeeplearning.com/chap1.html 42/47
1/10/2017 Neural networks and deep learning
up,soIwon'texplicitlywriteoutthecodeifyou'reinterestedit'sinthe
GitHubrepository.Butit'sabigimprovementoverrandomguessing,
getting2, 225ofthe10, 000testimagescorrect,i.e.,22.25percent
accuracy.
It'snotdifficulttofindotherideaswhichachieveaccuraciesinthe20 to50
percentrange.Ifyouworkabitharderyoucangetupover50 percent.But
togetmuchhigheraccuraciesithelpstouseestablishedmachinelearning
algorithms.Let'stryusingoneofthebestknownalgorithms,thesupport
vectormachineorSVM.Ifyou'renotfamiliarwithSVMs,nottoworry,
we'renotgoingtoneedtounderstandthedetailsofhowSVMswork.
Instead,we'lluseaPythonlibrarycalledscikitlearn,whichprovidesa
simplePythoninterfacetoafastCbasedlibraryforSVMsknownas
LIBSVM.
Ifwerunscikitlearn'sSVMclassifierusingthedefaultsettings,thenit
gets9,435of10,000testimagescorrect.(Thecodeisavailablehere.)
That'sabigimprovementoverournaiveapproachofclassifyinganimage
basedonhowdarkitis.Indeed,itmeansthattheSVMisperforming
roughlyaswellasourneuralnetworks,justalittleworse.Inlaterchapters
we'llintroducenewtechniquesthatenableustoimproveourneural
networkssothattheyperformmuchbetterthantheSVM.
That'snottheendofthestory,however.The9,435of10,000resultisfor
scikitlearn'sdefaultsettingsforSVMs.SVMshaveanumberoftunable
parameters,andit'spossibletosearchforparameterswhichimprovethis
outoftheboxperformance.Iwon'texplicitlydothissearch,butinstead
referyoutothisblogpostbyAndreasMuellerifyou'dliketoknowmore.
MuellershowsthatwithsomeworkoptimizingtheSVM'sparametersit's
possibletogettheperformanceupabove98.5percentaccuracy.Inother
words,awelltunedSVMonlymakesanerroronaboutonedigitin70.
That'sprettygood!Canneuralnetworksdobetter?
Infact,theycan.Atpresent,welldesignedneuralnetworksoutperform
everyothertechniqueforsolvingMNIST,includingSVMs.Thecurrent
(2013)recordisclassifying9,979of10,000imagescorrectly.Thiswas
donebyLiWan,MatthewZeiler,SixinZhang,YannLeCun,andRob
Fergus.We'llseemostofthetechniquestheyusedlaterinthebook.At
thatleveltheperformanceisclosetohumanequivalent,andisarguably
better,sincequiteafewoftheMNISTimagesaredifficultevenfor
humanstorecognizewithconfidence,forexample:
http://neuralnetworksanddeeplearning.com/chap1.html 43/47
1/10/2017 Neural networks and deep learning
Itrustyou'llagreethatthosearetoughtoclassify!Withimageslikethese
intheMNISTdatasetit'sremarkablethatneuralnetworkscanaccurately
classifyallbut21ofthe10,000testimages.Usually,whenprogramming
webelievethatsolvingacomplicatedproblemlikerecognizingthe
MNISTdigitsrequiresasophisticatedalgorithm.Buteventheneural
networksintheWanetalpaperjustmentionedinvolvequitesimple
algorithms,variationsonthealgorithmwe'veseeninthischapter.Allthe
complexityislearned,automatically,fromthetrainingdata.Insome
sense,themoralofbothourresultsandthoseinmoresophisticatedpapers,
isthatforsomeproblems:
sophisticatedalgorithmsimplelearningalgorithm+goodtrainingdata.
Towarddeeplearning
Whileourneuralnetworkgivesimpressiveperformance,thatperformance
issomewhatmysterious.Theweightsandbiasesinthenetworkwere
discoveredautomatically.Andthatmeanswedon'timmediatelyhavean
explanationofhowthenetworkdoeswhatitdoes.Canwefindsomeway
tounderstandtheprinciplesbywhichournetworkisclassifying
handwrittendigits?And,givensuchprinciples,canwedobetter?
Toputthesequestionsmorestarkly,supposethatafewdecadeshence
neuralnetworksleadtoartificialintelligence(AI).Willweunderstand
howsuchintelligentnetworkswork?Perhapsthenetworkswillbeopaque
tous,withweightsandbiaseswedon'tunderstand,becausethey'vebeen
learnedautomatically.IntheearlydaysofAIresearchpeoplehopedthat
theefforttobuildanAIwouldalsohelpusunderstandtheprinciples
behindintelligenceand,maybe,thefunctioningofthehumanbrain.But
perhapstheoutcomewillbethatweendupunderstandingneitherthe
brainnorhowartificialintelligenceworks!
Toaddressthesequestions,let'sthinkbacktotheinterpretationofartificial
neuronsthatIgaveatthestartofthechapter,asameansofweighing
evidence.Supposewewanttodeterminewhetheranimageshowsa
humanfaceornot:
Credits:1.EsterInbar.2.Unknown.3.NASA,ESA,
http://neuralnetworksanddeeplearning.com/chap1.html 44/47
1/10/2017 Neural networks and deep learning
G.Illingworth,D.Magee,andP.Oesch(University
ofCalifornia,SantaCruz),R.Bouwens(Leiden
University),andtheHUDF09Team.Clickonthe
imagesformoredetails.
Wecouldattackthisproblemthesamewayweattackedhandwriting
recognitionbyusingthepixelsintheimageasinputtoaneuralnetwork,
withtheoutputfromthenetworkasingleneuronindicatingeither"Yes,
it'saface"or"No,it'snotaface".
Let'ssupposewedothis,butthatwe'renotusingalearningalgorithm.
Instead,we'regoingtotrytodesignanetworkbyhand,choosing
appropriateweightsandbiases.Howmightwegoaboutit?Forgetting
neuralnetworksentirelyforthemoment,aheuristicwecoulduseisto
decomposetheproblemintosubproblems:doestheimagehaveaneyein
thetopleft?Doesithaveaneyeinthetopright?Doesithaveanoseinthe
middle?Doesithaveamouthinthebottommiddle?Istherehairontop?
Andsoon.
Iftheanswerstoseveralofthesequestionsare"yes",orevenjust
"probablyyes",thenwe'dconcludethattheimageislikelytobeaface.
Conversely,iftheanswerstomostofthequestionsare"no",thenthe
imageprobablyisn'taface.
Ofcourse,thisisjustaroughheuristic,anditsuffersfrommany
deficiencies.Maybethepersonisbald,sotheyhavenohair.Maybewe
canonlyseepartoftheface,orthefaceisatanangle,sosomeofthe
facialfeaturesareobscured.Still,theheuristicsuggeststhatifwecan
solvethesubproblemsusingneuralnetworks,thenperhapswecanbuilda
neuralnetworkforfacedetection,bycombiningthenetworksforthesub
problems.Here'sapossiblearchitecture,withrectanglesdenotingthesub
networks.Notethatthisisn'tintendedasarealisticapproachtosolvingthe
facedetectionproblemrather,it'stohelpusbuildintuitionabouthow
networksfunction.Here'sthearchitecture:
http://neuralnetworksanddeeplearning.com/chap1.html 45/47
1/10/2017 Neural networks and deep learning
It'salsoplausiblethatthesubnetworkscanbedecomposed.Suppose
we'reconsideringthequestion:"Isthereaneyeinthetopleft?"Thiscan
bedecomposedintoquestionssuchas:"Isthereaneyebrow?""Arethere
eyelashes?""Isthereaniris?"andsoon.Ofcourse,thesequestions
shouldreallyincludepositionalinformation,aswell"Istheeyebrowin
thetopleft,andabovetheiris?",thatkindofthingbutlet'skeepit
simple.Thenetworktoanswerthequestion"Isthereaneyeinthetop
left?"cannowbedecomposed:
Thosequestionstoocanbebrokendown,furtherandfurtherthrough
multiplelayers.Ultimately,we'llbeworkingwithsubnetworksthat
answerquestionssosimpletheycaneasilybeansweredatthelevelof
singlepixels.Thosequestionsmight,forexample,beaboutthepresence
orabsenceofverysimpleshapesatparticularpointsintheimage.Such
questionscanbeansweredbysingleneuronsconnectedtotherawpixels
intheimage.
Theendresultisanetworkwhichbreaksdownaverycomplicated
questiondoesthisimageshowafaceornotintoverysimplequestions
answerableatthelevelofsinglepixels.Itdoesthisthroughaseriesof
http://neuralnetworksanddeeplearning.com/chap1.html 46/47
1/10/2017 Neural networks and deep learning
manylayers,withearlylayersansweringverysimpleandspecific
questionsabouttheinputimage,andlaterlayersbuildingupahierarchyof
evermorecomplexandabstractconcepts.Networkswiththiskindof
manylayerstructuretwoormorehiddenlayersarecalleddeepneural
networks.
Ofcourse,Ihaven'tsaidhowtodothisrecursivedecompositionintosub
networks.Itcertainlyisn'tpracticaltohanddesigntheweightsandbiases
inthenetwork.Instead,we'dliketouselearningalgorithmssothatthe
networkcanautomaticallylearntheweightsandbiasesandthus,the
hierarchyofconceptsfromtrainingdata.Researchersinthe1980sand
1990striedusingstochasticgradientdescentandbackpropagationtotrain
deepnetworks.Unfortunately,exceptforafewspecialarchitectures,they
didn'thavemuchluck.Thenetworkswouldlearn,butveryslowly,andin
practiceoftentooslowlytobeuseful.
Since2006,asetoftechniqueshasbeendevelopedthatenablelearningin
deepneuralnets.Thesedeeplearningtechniquesarebasedonstochastic
gradientdescentandbackpropagation,butalsointroducenewideas.These
techniqueshaveenabledmuchdeeper(andlarger)networkstobetrained
peoplenowroutinelytrainnetworkswith5to10hiddenlayers.And,it
turnsoutthattheseperformfarbetteronmanyproblemsthanshallow
neuralnetworks,i.e.,networkswithjustasinglehiddenlayer.Thereason,
ofcourse,istheabilityofdeepnetstobuildupacomplexhierarchyof
concepts.It'sabitlikethewayconventionalprogramminglanguagesuse
modulardesignandideasaboutabstractiontoenablethecreationof
complexcomputerprograms.Comparingadeepnetworktoashallow
networkisabitlikecomparingaprogramminglanguagewiththeability
tomakefunctioncallstoastrippeddownlanguagewithnoabilitytomake
suchcalls.Abstractiontakesadifferentforminneuralnetworksthanit
doesinconventionalprogramming,butit'sjustasimportant.
Inacademicwork,pleasecitethisbookas:MichaelA.Nielsen,"NeuralNetworksandDeepLearning", Lastupdate:SunJan116:00:212017
DeterminationPress,2015
ThisworkislicensedunderaCreativeCommonsAttributionNonCommercial3.0UnportedLicense.Thismeans
you'refreetocopy,share,andbuildonthisbook,butnottosellit.Ifyou'reinterestedincommercialuse,please
contactme.
http://neuralnetworksanddeeplearning.com/chap1.html 47/47