Leaving No Stone Unturned: Flexible Retrieval of Idiomatic Expressions from a Large Text Corpus
Abstract
:1. Introduction
2. Related Work
3. Methods
3.1. Technology
3.1.1. Elasticsearch
3.1.2. Indexing
3.1.3. Querying
3.2. Implementation
3.2.1. Data
3.2.2. Indexing
‘He had swum against the tide.’
‘But now we know they’re just wolves in sheep’s clothing.’
‘A polite and considered approach, avoiding outright confrontation, will stand you in better stead.’
3.2.3. Querying
Modification
Open Slots
Passivisation
3.2.4. Search Results
4. Results
4.1. Baseline
4.2. Measures
4.3. Testbed
4.4. Results
‘Work out a rough way of what you’re going to give for.’
‘Some may go to out-of-the-way places to sniff which can add to the dangers.’
‘We went out separate ways, nearly all of us to be affected in one way or another by the Bodyline tour.’
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Code Availability
References
- Buerki, A. (How) is formulaic language universal? Insights from Korean, German and English. In Formulaic Language and New Data: Theoretical and Methodological Implications, Formulaic Language; Piirainen, E., Filatkina, N., Stumpf, S., Pfeiffer, C., Eds.; De Gruyter: Berlin, Germany, 2020; Volume 2, pp. 103–134. [Google Scholar]
- Wray, A.; Grace, G.W. The consequences of talking to strangers: Evolutionary corollaries of socio-cultural influences on linguistic form. Lingua 2007, 117, 543–578. [Google Scholar] [CrossRef]
- Wray, A. Formulaic Language and the Lexicon; Cambridge University Press: Cambridge, UK, 2002. [Google Scholar]
- Hanks, P. Lexical Analysis: Norms and Exploitations; MIT Press: Cambridge, MA, USA, 2013. [Google Scholar]
- BBC Radio 4. Spoilers for May 25th–28th 2020. Available online: https://www.facebook.com/notes/archers-appreciation/spoilers-for-may-25th28th-2020/851327765348107/ (accessed on 1 January 2021).
- Moon, R. Fixed Expressions and Idioms in English: A Corpus-Based Approach; OUP Oxford: Oxford, UK, 1998; p. 356. [Google Scholar]
- Haagsma, H.; Nissim, M.; Bos, J. The other side of the coin: Unsupervised disambiguation of potentially idiomatic expressions by contrasting senses. In Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions, Santa Fe, NM, USA, 25–26 August 2018; pp. 178–184. [Google Scholar]
- Cook, P.; Fazly, A.; Stevenson, S. Pulling their weight: Exploiting syntactic forms for the automatic identification of idiomatic expressions in context. In Proceedings of the Workshop on A Broader Perspective on Multiword Expressions, Prague, Czech Republic, 28 June 2007; pp. 41–48. [Google Scholar]
- Peng, J.; Feldman, A.; Street, L. Computing linear discriminants for idiomatic sentence detection. Res. Comput. Sci. 2010, 46, 17–28. [Google Scholar]
- Feldman, A.; Peng, J. Automatic detection of idiomatic clauses. In Proceedings of the 14th International Conference on Computational Linguistics and Intelligent Text Processing, Samos, Greece, 24–30 March 2013; pp. 435–446. [Google Scholar]
- Liu, C.; Hwa, R. Heuristically informed unsupervised idiom usage recognition. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October 2018–4 November 2018; pp. 1723–1731. [Google Scholar]
- Sporleder, C.; Li, L. Unsupervised recognition of literal and non-literal use of idiomatic expressions. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, Athens, Greece, 30 March–3 April 2009; pp. 754–762. [Google Scholar]
- Halliday, M.A.K.; Hasan, R. Cohesion in English; Longman: London, UK, 1976; p. 374. [Google Scholar]
- Miller, G.A. WordNet: A lexical database for English. Commun. ACM 1995, 38, 39–41. [Google Scholar] [CrossRef]
- Cilibrasi, R.L.; Vitanyi, P.M.B. The Google similarity distance. IEEE Trans. Knowl. Data Eng. 2007, 19, 370–383. [Google Scholar] [CrossRef]
- Liu, P.; Qian, K.; Qiu, X.; Huang, X. Idiom-aware compositional distributed semantics. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 7–11 September 2017; pp. 1204–1213. [Google Scholar]
- Tai, K.S.; Socher, R.; Manning, C.D. Improved semantic representations from tree-structured long short-term memory networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China, 26–31 July 2015; pp. 1556–1566. [Google Scholar]
- Salton, G.D.; Ross, R.J.; Kelleher, J.D. Idiom token classification using sentential distributed semantics. In Proceedings of the 54th Annual Meeting on Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016; pp. 194–204. [Google Scholar]
- Williams, L.; Bannister, C.; Arribas-Ayllon, M.; Preece, A.; Spasić, I. The role of idioms in sentiment analysis. Expert Syst. Appl. 2015, 42, 7375–7385. [Google Scholar] [CrossRef] [Green Version]
- Spasić, I.; Williams, L.; Buerki, A. Idiom-based features in sentiment analysis: Cutting the Gordian knot. IEEE Trans. Affect. Comput. 2020, 11, 189–199. [Google Scholar] [CrossRef] [Green Version]
- Flor, M.; Klebanov, B.B. Catching idiomatic expressions in EFL essays. In Proceedings of the Workshop on Figurative Language Processing, New Orleans, LA, USA, 6 June 2018; pp. 34–44. [Google Scholar]
- Pike, R. The text editor sam. Softw. Pract. Exp. 1987, 17, 813–845. [Google Scholar] [CrossRef]
- Laurikari, V. NFAs with tagged transitions, their conversion to deterministic automata and application to regular expressions. In Proceedings of the Seventh International Symposium on String Processing and Information Retrieval, La Curuna, Spain, 27–29 September 2000; pp. 181–187. [Google Scholar]
- Cox, R. Regular Expression Matching can be Simple and Fast (but is Slow in Java, Perl, PHP, Python, Ruby,...). Available online: https://swtch.com/~rsc/regexp/regexp1.html (accessed on 1 January 2021).
- Gormley, C.; Tony, Z. Elasticsearch: The Definitive Guide; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2015. [Google Scholar]
- Białecki, A.; Muir, R.; Ingersoll, G. Apache Lucene 4. In Proceedings of the SIGIR Workshop on Open Source Information Retrieval, Portland, OR, USA, 16 August 2012; pp. 17–24. [Google Scholar]
- Burnard, L. Reference Guide for the British National Corpus (XML Edition). 2007. Available online: http://www.natcorp.ox.ac.uk/docs/URG/ (accessed on 1 January 2021).
- BNC Consortium. The British National Corpus, Version 3 (BNC XML Edition). Available online: http://www.natcorp.ox.ac.uk/ (accessed on 1 January 2021).
- Tumblr. I Don’t Think Uma will Ever Fully Forgive Mal. Available online: https://tmblr.co/ZVqBcbYOic9sWi00 (accessed on 1 January 2021).
- Bird, S.; Klein, E.; Loper, E. Natural Language Processing with Python—Analyzing Text with the Natural Language Toolkit; O’Reilly Media: Newton, MA, USA, 2009; p. 504. [Google Scholar]
- Vega-Moreno, R.E. Creativity and Convention: The Pragmatics of Everyday Figurative Speech; John Benjamins Publishing Company: Newton, MA, USA, 2007; p. 264. [Google Scholar]
- Langlotz, A. Idiomatic Creativity: A Cognitive-Linguistic Model of Idiom-Representation and Idiom-Variation in English; John Benjamins: Amsterdam, The Netherlands, 2006; p. 326. [Google Scholar]
- Dutton, K. Exploring the Boundaries of Formulaic Sequences: A Corpus-Based Study of Lexical Substitution and Insertion in Contemporary British English; VDM Verlag: Saarbrücken, Germany, 2009; p. 272. [Google Scholar]
- Riehemann, S.Z. A Constructional Approach to Idioms and Word Formation; Stanford University: Stanford, CA, USA, 2001. [Google Scholar]
- Porter, M.F. An algorithm for suffix stripping. Program 1980, 14, 130–137. [Google Scholar] [CrossRef]
- Beke, K. Learn English Today. Available online: https://www.learn-english-today.com/ (accessed on 1 January 2021).
- Robertson, S.; Zaragoza, H. The probabilistic relevance framework: BM25 and beyond. Found. Trends Inf. Retr. 2009, 3, 333–389. [Google Scholar] [CrossRef]
- Clarke, S.J.; Willett, P. Estimating the recall performance of Web search engines. Aslib Proc. 1997, 49, 184–189. [Google Scholar] [CrossRef]
- Cohen, J. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 1960, 20, 37–46. [Google Scholar] [CrossRef]
- Fleiss, J.L.; Cohen, J.; Everitt, B.S. Large sample standard errors of kappa and weighted kappa. Psychol. Bull. 1969, 72, 323–327. [Google Scholar] [CrossRef] [Green Version]
- Altman, D.G. Practical Statistics for Medical Research; Chapman and Hall/CRC: Boca Raton, FL, USA, 1990; p. 624. [Google Scholar]
- Richards, J.C.; Rogers, T.S. Approaches and Methods in Language Teaching; Cambridge University Press: Cambridge, UK, 2006. [Google Scholar]
- Li, Y.; Yosinski, J.; Clune, J.; Lipson, H.; Hopcroft, J. Convergent learning: Do different neural networks learn the same representations? In Proceedings of the 1st NIPS International Workshop on Feature Extraction: Modern Questions and Challenges, Montréal, QC, Canada, 11–12 December 2015; pp. 196–212. [Google Scholar]
- Price, L.; Thelwall, M. The clustering power of low frequency words in academic Webs. J. Assoc. Inf. Sci. Technol. 2005, 56, 883–888. [Google Scholar] [CrossRef] [Green Version]
- Schönhofen, P.; Benczúr, A.A. Exploiting extremely rare features in text categorization. In Proceedings of the 17th European Conference on Machine Learning, Berlin, Germany, 18–22 September 2006; pp. 759–766. [Google Scholar]
- Fadaee, M.; Bisazza, A.; Monz, C. Examining the tip of the iceberg: A data set for idiom translation. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018; pp. 925–929. [Google Scholar]
ID | Variation | Problem | Solution | POS | Example Query |
---|---|---|---|---|---|
1 | modification | insertion | slop | noun + verb | {“query”: “call someone’s bluff”, “slop”: 4} |
2 | open slot | replacement | wildcard + slop | pronoun | {“query”: “call * bluff”, “slop”: 5} |
3 | passivisation with modification | reordering + insertion | reordering + slop | verb | {“query”: “someone’s bluff * call”, “slop”: 5} |
4 | passivisation with an open slot | reordering + replacement | reordering + wildcard + slop | verb + pronoun | {“query”: “* bluff * call”, “slop”: 6} |
ID | Variation | Idiom | Example |
---|---|---|---|
1 | derivation | bury the hatchet | These occasions are marked by much conviviality and the temporary burying of hatchets. |
2 | compounding | grease someone’s palm | Palm-greasing for just about anything from entry to a favoured school to obtaining a bank loan has been considered a fact of life. |
3 | negation | grease someone’s palm | The gondoliers threatened to go on strike and all the floodlights on the night of the show were mysteriously switched off because someone hadn’t had their palm greased. |
4 | modification by a prepositional phrase | open the floodgates | The floodgates to total permissiveness were opened and a society in which “the permissive intellectual’s anything goes” was created. |
5 | distribution over multiple clauses | born with a silver spoon in one’s mouth | She was born, if not with a silver spoon in her mouth, then certainly not one with any chicken soup on it. |
6 | orthographic simulation of stutter | head over heels | H-have to admit it, old thing, I’m h-head over h-heels in love with you. |
Label | Idiomatic | Unclear | Literal | Other | Total |
---|---|---|---|---|---|
Idiomatic | 122 | 0 | 2 | 1 | 125 |
Unclear | 3 | 4 | 0 | 0 | 7 |
Literal | 0 | 0 | 19 | 0 | 19 |
Other | 3 | 0 | 1 | 43 | 47 |
Total | 128 | 4 | 22 | 44 | 198 |
Idiom | TP | FP | P | R | F |
---|---|---|---|---|---|
bitten by the bug | 11 | 0 | 100.00% | 100.00% | 100.00% |
blot one’s copy-book | 5 | 0 | 100.00% | 100.00% | 100.00% |
come up in the world | 8 | 2 | 80.00% | 22.86% | 35.56% |
lose one’s marbles | 16 | 2 | 88.89% | 100.00% | 94.12% |
get the better of someone | 98 | 2 | 98.00% | 51.31% | 67.35% |
give the all clear | 39 | 1 | 97.50% | 66.10% | 78.79% |
go out of one’s way | 1 | 5 | 16.67% | 12.50% | 14.29% |
risk life and limb | 24 | 1 | 96.00% | 96.00% | 96.00% |
sink or swim | 31 | 0 | 100.00% | 96.88% | 98.41% |
turn the tide | 98 | 2 | 98.00% | 73.13% | 83.76% |
Micro-Average | Macro-Average | |||||
---|---|---|---|---|---|---|
Method | P | R | F | P | R | F |
Phrase search | 99.92% | 31.20% | 47.55% | 81.98% | 33.35% | 47.41% |
Flexible search | 95.33% | 82.79% | 88.62% | 95.28% | 85.92% | 90.36% |
Keyword search | 73.06% | 44.63% | 55.42% | 72.54% | 49.44% | 58.80% |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Hughes, C.; Filimonov, M.; Wray, A.; Spasić, I. Leaving No Stone Unturned: Flexible Retrieval of Idiomatic Expressions from a Large Text Corpus. Mach. Learn. Knowl. Extr. 2021, 3, 263-283. https://doi.org/10.3390/make3010013
Hughes C, Filimonov M, Wray A, Spasić I. Leaving No Stone Unturned: Flexible Retrieval of Idiomatic Expressions from a Large Text Corpus. Machine Learning and Knowledge Extraction. 2021; 3(1):263-283. https://doi.org/10.3390/make3010013
Chicago/Turabian StyleHughes, Callum, Maxim Filimonov, Alison Wray, and Irena Spasić. 2021. "Leaving No Stone Unturned: Flexible Retrieval of Idiomatic Expressions from a Large Text Corpus" Machine Learning and Knowledge Extraction 3, no. 1: 263-283. https://doi.org/10.3390/make3010013
APA StyleHughes, C., Filimonov, M., Wray, A., & Spasić, I. (2021). Leaving No Stone Unturned: Flexible Retrieval of Idiomatic Expressions from a Large Text Corpus. Machine Learning and Knowledge Extraction, 3(1), 263-283. https://doi.org/10.3390/make3010013