Professional Documents
Culture Documents
Abstract
Scripts are standardized event sequences describing typical everyday activities, which play an important role in the computational mod-
eling of cognitive abilities (in particular for natural language processing). We present a large-scale crowdsourced collection of explicit
linguistic descriptions of script-specific event sequences (40 scenarios with 100 sequences each). The corpus is enriched with crowd-
sourced alignment annotation on a subset of the event descriptions, to be used in future work as seed data for automatic alignment of
event descriptions (for example via clustering). The event descriptions to be aligned were chosen among those expected to have the
strongest corrective effect on the clustering algorithm. The alignment annotation was evaluated against a gold standard of expert anno-
tators. The resulting database of partially-aligned script-event descriptions provides a sound empirical basis for inducing high-quality
script knowledge, as well as for any task involving alignment and paraphrase detection of events.
Keywords: scripts, events, crowdsourcing, paraphrase
3494
Choose Recipe Buy Ingredients Prepare Ingredients
– look up R – purchase ING – stir to combine
– find R – buy ING – mix well
– get your R – buy proper ING – mix ING together in B
buy
ingred. – stir ING
Figure 2: Example of an induced script structure for the BAKING A CAKE scenario.
choose_recipe and store any leftovers in the fridge clearly are outliers.
Review desired recipe, Look up a cake recipe, The noise is to some degree due to the complex nature of
Print out or write down the recipe, Read recipe, ...
script structures in general, but it is also the price one has
buy_ingredients to pay for a gain in flexibility of event ordering.
Buy other ingredients if you do not have at home, Clustering algorithms rely on good estimates of similar-
Buy cake ingredients, Purchase ingredients, ...
ity among the data points. To appropriately group event
get_ingredients descriptions into paraphrase sets, the clustering algorithm
Gather all ingredients, Set out ingredients, Gather ingredients, would need information on script-specific semantic simi-
gather together cake ingredients such as eggs, butter, ... larity that goes beyond pure semantic similarity. For in-
add_ingredients stance, in the FLYING IN AN AIRPLANE scenario, it is not
Add water, sugar, beaten egg and salt one by one, trivial for any semantical similarity measure to predict that
Whisk after each addition,Add the dry mixture to the wet mixture, walk up the ramp and board plane are functionally simi-
Mix the dry ingredients in one bowl (flour, baking soda, salt, etc), lar with respect to the given scenario. To address this is-
Add ingredients in mixing bowl, get mixing bowl, ... sue, we collect partial alignment information that we will
prepare_ingredients use as seed data in future work on semi-supervised clus-
Mix them together, Open the ingredients, Stir ingredients, tering. The alignment annotations are also suitable for a
Combine and mix all the ingredients as the recipe delegates, semi-supervised extension of the event-ordering model of
Mix ingredients with mixer, ... Frermann et al. (2014).
put_cake_oven In this work, we have taken measures to provide a sound
Put the mix in the pans,Put the cake batter in the oven, empirical basis for better-quality script models, by ex-
Put cake in oven, Put greased pan into preheated stove, tending existing corpora in two different ways. First, we
Store any leftovers in the fridge, Cover it and put it on a oven plate crowdsourced a corpus of 40 scenarios with 100 ESDs
Put the prepared oven plate inside oven ...,
each, thus going beyond the size of previous script collec-
tions. Second, we enriched the corpus with partial align-
Figure 3: Example clusters (event labels are underlined,
ments of ESDs, done by human annotators. The result is
outliers are in italics).
a corpus of partially-aligned generic activity descriptions,
the DeScript corpus (Describing Script Structure). More
Script events are temporally ordered by default, but their generally, DeScript is a valuable resource for any task in-
order can vary. For example, when baking a cake, one volving alignment and paraphrase detection of events.
can preheat the oven before or after mixing the ingredi- The corpus is publicly available for scientific re-
ents. MSA does not allow for crossing alignments, and search purposes at this url: http://www.sfb1102.
thus is not able to model order variation: this leads to an uni-saarland.de/?page_id=2582.
inappropriately fine-grained inventory of event types.
2. Data Collection
Clustering algorithms using both semantic similarity and
positional similarity information provide an alternative 2.1. ESD Collection
way to induce scripts. They can be made sensitive to or- Scenario choices were based on previous work by Raisig
dering information, but do not use it as a hard constraint, et al. (2009), Regneri et al. (2010) and Singh et al.
and therefore allow for an appropriate treatment of order (2002). We included scenarios that require simple general
variation. We have experimented with several clustering knowledge (e.g. WASHING ONE ’ S HAIR), as well as more
algorithms. Figure 3 shows an example output of the best- complex ones (e.g. BORROWING A BOOK FROM THE LI -
performing algorithm, Affinity Propagation (AP, Frey and BRARY ), scenarios showing a considerable degree of vari-
Dueck (2007)). The outcome is quite noisy. For instance, ablity (e.g. GOING TO A FUNERAL) and scenarios requir-
in the last cluster, which mainly consists of descriptions of ing some amount of expert knowledge (e.g. RENOVATING
the “putting cake in oven” event, put the mix in the pans A ROOM ).
3495
TARGET ESD
No Linking Possible
SOURCE ESD
Take out box of cake mix from shelf
Purchase cake mix Gather together cake ingredients
Preheat oven
Grease pan Get mixing bowl, or spoon or fork
Carefully follow instructions step by step Add ingredients to bowl
Open mix and add required ingredients
Mix well Stir together and mix
Pour into prepared pan Use fork to breakup clumps
Set timer on oven
Bake cake for required time Preheat oven
Remove cake from oven and cool Spray pan with non stick or grease
Turn cake out onto cake plate
Apply icing or glaze Pour cake mix into pan
Put pan into oven
Bake cake
ED-to-ED
Remove cake pan when timer goes off
ED-to-inbetween
no link Stick tooth pick into cake to see if done
Let cake pan cool then remove cake
No Linking Possible
SOURCE ESD
Take out box of cake mix from shelf
Purchase cake mix Gather together cake ingredients
Preheat oven
Grease pan Get mixing bowl, or spoon or fork
Carefully follow instructions step by step Add ingredients to bowl
Open mix and add required ingredients
Mix well Stir together and mix
Pour into prepared pan Use fork to breakup clumps
Set timer on oven
Bake cake for required time Preheat oven
Remove cake from oven and cool Spray pan with non stick or grease
Turn cake out onto cake plate
Apply icing or glaze Pour cake mix into pan
Put pan into oven
Bake cake
Remove cake pan when timer goes off
Stick tooth pick into cake to see if done
Let cake pan cool then remove cake
3496
ESD (ED-to-ED link, e.g. pour into prepared pan → pour Additionally, we included gold seeds, that is a small subset
cake mix into pan in Figure 4a). If the target ESD did of the event descriptions aligned by experts as part of our
not contain any matching event description, the workers gold standard annotation (see Section 2.3.), to be used for
could either select a position between two event descrip- an evaluation of the workers’ annotation. 5% of the event
tions on the target side where the source event would usu- descriptions per scenario were included as gold seeds. The
ally take place (ED-to-in-between links, e.g., set timer on workers were not aware of what alignments were outliers,
oven could take place between put pan into oven and bake stable cases or gold seeds.
cake), or they could indicate that no linking at all is possi- Each source ESD containing outlier, stable or gold event
ble (no-links). This latter option is useful in case of spuri- descriptions was matched with 3 target ESDs. Each
ous events (e.g., carefully follow instructions step by step source-target ESD pair was annotated by 3 different an-
is not really an event) but also for alternative realizations notators. In total approximately 600 outlier event descrip-
of a script (e.g. paying cash vs. with a credit card). tions, 300 stable event descriptions and 120 gold seeds
Multiple target were selected for each scenario and presented to workers
If the workers felt that the source event was described in for annotation. 292 workers (native speakers of English)
more detail in the target ESD compared to the source ESD, took part in the annotation study. They were paid 0.35
they could link the source event description to more than USD per ESD and took on average 1.05 minutes per ESD.
one event description in the target ESD (e.g. mix well →
stir together and mix, use fork to break up clumps, which 2.3. Gold Standard Alignment Annotation
can be broken down to two or more ED-to-ED links, see The DeScript corpus also includes a gold standard cor-
Figure 4b). Also when linking the source description to pus of aligned event descriptions annotated by experts, to
multiple descriptions in the target ESD, workers could be used in the evaluation of semi-supervised clustering of
choose in-between positions (ED-to-in-between links). event descriptions. A small subset of the gold standard
alignments were used to evaluate the quality of the crowd-
Seed data selection sourced alignments. The gold standard covers 10 scenar-
In order to minimize the amount of seed data that would be ios, with 50 ESDs each.
needed for semi-supervised clustering, we employed sev- Four experts, all computational linguistics students, were
eral criteria for choosing the most informative event de- trained for the task and were presented with a source and
scriptions to be aligned. Informative seeds are expected to a target ESD in an interface where (unlike the M-Turk in-
have the strongest corrective effect on the clustering algo- terface) no event description in the source ESD was high-
rithm. Thus, the event descriptions to be aligned should lighted. They were to link all event descriptions in the
be the borderline cases that are the most difficult for the source ESD with all event descriptions in the target ESD
algorithm. In order to select the borderline cases, hence- that were semantically similar, with respect to the given
forth called outliers, we used two methods. First, we ran scenario, to those in the source ESD. Every ESD in a
the Affinity Propagation clustering algorithm varying the given scenario was paired with every other ESD in the
preference parameter that determines the cost of creating same scenario. In the end, all similar event descriptions
clusters, thus leading to different configurations of clusters were aligned and the full alignments were used to group
(i.e. to a varying number of clusters and cluster sizes), and the event descriptions into gold paraphrase sets.
we chose those event descriptions that changed their neigh-
bors and cluster centers as the number of clusters increased
or decreased. Secondly, we chose those event descriptions Scenario EDs Gold
that were not well clustered as measured by the Silhouette annotated excluded sets
index, which takes into account the average dissimilarity of
baking a cake 513 29 20
an item to the members of its cluster and its lowest average
borrowing a book 389 46 16
dissimilarity to the members of the other clusters. The two
from the library
criteria used would select the most difficult cases for the
flying in an airplane 504 46 24
algorithm, that is, cases whose true alignment is expected
getting a haircut 418 52 23
to be most informative.
going grocery 505 95 18
We also included event descriptions that were not outliers,
shopping
henceforth called stable cases, that were used as a baseline
going on a train 357 45 12
when evaluating the inter-annotator agreement to show the
planting a tree 344 41 13
difficulty of the outliers. The stable cases were randomly
repairing a flat bicycle 363 40 16
selected from those event descriptions that were not out-
tire
liers. We expect more annotator agreement in stable cases
riding on a bus 358 23 14
as compared to the outliers. Approximately 20% of the
taking a bath 479 29 20
event descriptions per scenario were selected as outliers
and 10% were selected as stable cases and presented to the Total/Average 4230 446 (10.5%) 18
workers to align with another event2 .
Table 1: Gold alignment annotation: the annotated EDs
2 Note that we refer to the number of descriptions per scenario. for each scenario, the excluded EDs for each scenario (sin-
The annotated alignments are less than the 1% of all possible
gletons or unrelated events) and the number of gold para-
alignments (c.a. 0.1-0.2%). phrase sets obtained.
3497
Scenario ED-to-ED links ED-to-in-between links No-links Total links
baking a cake 1836 831 448 3115
borrowing a book from the library 1787 1146 254 3187
flying in an airplane 1676 1219 192 3087
getting a haircut 1887 926 292 3105
going grocery shopping 1863 997 251 3111
going on a train 1937 884 285 3106
planting a tree 1902 845 343 3090
repairing a flat bicycle tire 1500 1019 612 3131
riding on a bus 2226 750 114 3090
taking a bath 1898 891 314 3103
Total 18512 9508 3105 31125
Table 2: All links drawn for all source-target ESD pairs by all annotators. Note: the first column includes both ED-to-ED
links from single-target cases and each single ED-to-ED link (arrow) in multiple-target cases.
In creating the gold paraphrase sets, we assumed script- sen 30021 times) over multiple-target links (which were
specific functional equivalence, that is, we instructed the chosen 238 times).
annotators to group event descriptions serving the same We distinguish between one-to-one alignments, that is,
function in the script as semantically similar (e.g, scan bus cases which all three annotators considered to be single-
pass, pay for fare and show driver ticket are grouped into target cases, since they used exactly one link between
the same paraphrase set in the RIDING ON A BUS scenario, source and target ESD (including in-between-links and no-
as they all represent the “pay” event). Each paraphrase set links), and one-to-many alignments, that is cases which at
was annotated with an event label that indicated the event least one annotator considered to be multiple-target cases,
being expressed in the given paraphrase set (e.g. in Figure using more than one link between source and target ESD.
2, choose_recipe and buy_ingredients are example event Note that, as annotators preferred single-target links over
labels for BAKING A CAKE). Event labels were harmo- multiple-target links, most source event descriptions are
nized with those used in the annotation of stories in Modi annotated as one-to-one alignments (Table 3).
et al. (2016) (see Section 4.). Many workers were involved in the alignment annotation
Paraphrase sets that were singletons (e.g. in BAKING A task and not all of them aligned the complete set of high-
CAKE , return to oven occurred only once) were considered lighted event descriptions. For this reason, we computed
not representative of the events in the scenario, and hence agreement as the number of times where the majority of
were removed based on the gold annotation. Likewise, workers agreed in an alignment instance, instead of the
event descriptions that were not related to the script (e.g., Kappa (Fleiss, 1971).
in RIDING ON A BUS, pray your bus is on time) or that
were related to the scenario but not really part of the script One-to-one alignments
(e.g., in BAKING A CAKE, store any leftovers in the fridge), Figure 5 shows how well the workers agreed with each
were also removed. In the end, approximately 10.5% of the other in the annotation of the outliers and stable cases for
annotated event descriptions were excluded. Table 1 indi- one-to-one alignments. As expected, on average, workers
cates the number of excluded event descriptions and the tended to agree more in stable cases as compared to out-
number of gold clusters per scenario. liers. The relatively low agreement figures for BORROW-
The result of the gold standard annotation is a rich resource ING A BOOK FROM THE LIBRARY can be explained by the
of full alignments for all event descriptions in 500 ESDs variety and complexity of the scenario: e.g. given a source
(10 scenarios with 50 ESDs each) grouped in gold para- event description find the book on the shelf, and select a
phrase sets, each containing different linguistic variations book and take the book off the shelf on the target ESDs, it
of the events in the scenario. is not obvious if the source event is most similar to find the
book on the shelf, take the book off the shelf or in between
3. Data Analysis and Evaluation the two.
3498
One-to-one alignments One-to-many alignments
Scenario
tot maj. agreem. % agreem. tot overlap avg Dice
baking a cake 1091 885 0.81 37 28 0.38
borrowing a book from the library 1081 718 0.66 55 30 0.4
flying in an airplane 1081 904 0.84 14 13 0.36
getting a haircut 1003 810 0.81 13 12 0.44
going grocery shopping 993 856 0.86 28 23 0.41
going on a train 982 810 0.82 34 32 0.45
planting a tree 1000 822 0.82 21 18 0.63
repairing a flat bicycle tire 991 708 0.71 29 17 0.2
riding on a bus 1074 914 0.85 28 25 0.46
taking a bath 1101 933 0.85 26 24 0.38
Total 10397 8360 0.80 285 222 0.41
taking a bath
riding on a bus
planting a tree
going on a train
flying in an airplane
baking a cake
0 10 20 30 40 50 60 70 80 90 100
% Agreement
Figure 5: Worker agreement for outliers and stable cases in one-to-one alignments.
3541 9963
two scenarios are among those 0 with the highest complexity Agreement between the worker’s majority vote and the
59 162
and variability in how the script could be carried out. gold annotation was 81%. The cases where the workers did
The overall average Dice is 0.41, that is, the agreement is Page not agree with the gold annotation also illustrate the inher-
17
quite good on corresponding core events, although there ent complexity of the scripts. For example, in BORROW-
may be disagreement about the precise sequence in the tar- ING A BOOK FROM THE LIBRARY , the workers aligned
get ESD that corresponds to a given source event. For in- take the book home in the source ESD to the position be-
stance, in Figure 4b workers may agree on the correspond- tween leave library and read book in the target ESD, while
ing core event (as mix well is semantically similar to stir to- the experts aligned the same description to leave library.
gether and mix), but they may not agree on the correspond- This shows that the workers were not necessarily wrong in
ing span, whether it is most similar only to stir together all the cases where they did not agree with the gold seeds.
and mix or it also entails use fork to break up clumps.
No-links
Gold seeds Interestingly, the workers tended to use no-links for those
Recall that besides the outliers (the difficult cases) and the event descriptions that were not really events (e.g., in BAK -
stable cases (which were used as a baseline), we also in- ING A CAKE : carefully follow instructions step by step) or
cluded gold seeds for evaluation purposes, in order to com- event descriptions that were unrelated to the given scenario
pare the workers’ annotation against expert annotation. (e.g., in RIDING ON A BUS: sing if desired). The event
Unsurprisingly, majority agreement in both stable cases descriptions annotated as no-links by the workers tend to
and gold seeds was higher than agreement on outliers overlap with those marked by experts as unrelated event
(87% for gold seeds, 82% for stable cases and 78% for descriptions in the gold standard. These cases typically
outliers), showing that, while our outlier selection method involve event descriptions that are misleading and should
effectively selects more challenging cases, the quality of not be part of the script. This shows that certain event de-
the annotation is still very satisfactory. scriptions are spurious, cases which we cannot expect a
3499
clustering algorithm to group in any meaningful way.
55
50
4. Comparison with the InScript Stories 45
40
As mentioned in the introduction, this work is part of a 35 DeScript
30 InScript
larger research effort where we seek to provide a solid
empirical basis for high-quality script modeling. As part 25
20
of this larger project, Modi et al. (2016) crowdsourced a bicycle cake grocery library tree
corpus of simple scenario-related stories, the InScript cor- bath bus flight haircut train
3500
7. Bibliographical References
Barr, A. and Feigenbaum, E. A. (1981). Frames and
scripts. In The Handbook of Artificial Intelligence, vol-
ume 3, pages 216–222. Addison-Wesley, California.
Chambers, N. and Jurafsky, D. (2009). Unsupervised
learning of narrative schemas and their participants. In
Proceedings of 47th Annual Meeting of the ACL and the
4th IJCNLP of the AFNLP, pages 602–610, Suntec, Sin-
gapore.
Durbin, R., Eddy, S., Krogh, A., and Mitchison, G. (1998).
Biological Sequence Analysis: Probabilistic Models
of Proteins and Nucleic Acids. Cambridge University
Press, Cambridge, UK.
Fleiss, J. (1971). Measuring nominal scale agreement
among many raters. Psychological Bulletin, 76(5):378–
382.
Frermann, L., Titov, I., and Pinkal, M. (2014). A hier-
archical Bayesian model for unsupervised induction of
script knowledge. In Proceedings of the 14th Confer-
ence of the European Chapter of the Association for
Computational Linguistics, pages 49–57, Gothenburg,
Sweden.
Frey, B. J. and Dueck, D. (2007). Clustering by pass-
ing messages between data points. Science Direct,
315(5814):972–976.
McCarthy, P. M. and Jarvis, S. (2010). MTLD, vocd-
D, and HD-D: A validation study of sophisticated ap-
proaches to lexical diversity assessment. Behavior Re-
search Methods, 42(2):381–392.
Modi, A., Anikina, T., Ostermann, S., and Pinkal, M.
(2016). InScript: Narrative texts annotated with script
information. In Proceedings of the 10th International
Conference on Language Resources and Evaluation
(LREC 16), Portoroz̆, Slovenia.
Raisig, S., Welke, T., Hagendorf, H., and der Meer, E. V.
(2009). Insights into knowledge representation: The
influence of amodal and perceptual variables on event
knowledge retrieval from memory. Cognitive Science,
33(7):1252–1266.
Regneri, M., Koller, A., and Pinkal, M. (2010). Learning
script knowledge with web experiments. In Proceedings
of the 48th Annual Meeting of the Association for Com-
putational Linguistics, pages 979–988, Uppsala, Swe-
den.
Schank, R. C. and Abelson, R. P. (1977). Scripts, Plans,
Goals, and Understanding: An Inquiry into Human
Knowledge Structures. Hillsdale, NJ: Erlbaum.
Singh, P., Lin, T., Mueller, E., Lim, G., Perkins, T.,
and Li Zhu, W. (2002). Open Mind Common Sense:
Knowledge acquisition from the general public. In
Robert Meersman et al., editors, On the Move to
Meaningful Internet Systems 2002: CoopIS, DOA, and
ODBASE, pages 1223–1237. Springer, Berlin / Heidel-
berg, Germany.
3501