Precognox at CILC 2013 – The Use of Corpora in Natural Language Processing

Our computational linguist Zoltan Varju presented his thought on the use of corpora in natural language processing at V. International Conference on Corpus Linguistics (CILC 2013). Here is the abstract:

Some researchers suggest[1] that in the analysis of corpora, even less sophisticated algorithms give better results using large, web-scale corpora. Corpus based language models brings empirical evidence into linguistic inquires and statistical methods have become the state-of-the-art techniques in natural language processing and linguistics[2].

On the other hand, we have to face methodological question when we are using web corpora. In most cases, the industry unconsciously relies on Leech notion of representativeness[3] and aims to use a corpus that is big enough to make generalization to the whole language. However, usage determines sampling and we cannot generalize outside the domain of our data. One of the most striking example of this vicious circle is named entity recognition, which is a notoriously domain specific task.

Although we aim full automatic solutions, we are very far from such applications. The human factor in processing corpora is still important and we need more elaborated methods. One promising direction can be crowdsourcing, that reduces the time and costs of annotation[4], but the costs of expertise in data curation cannot be saved. Titles like [5] show that the industry is interested in standard practices and needs guidance to overcome ad hoc, domain specific solutions.

[1] Alon Harvey – Peter Norvig – Fernando Pereira: The Unreasonable Effectiveness of Data, IEEE Intelligent Systems, March/April 2009, p 8-12
[2] Peter Norvig: Colorless green ideas learn furiously: Chomsky and the two cultures of statistical learning, Significance, 2012 August, Vol. 9, Issue 4, p.30-33
[3] Leech, G. (1991). The state of the art in corpus linguistics. In Aijmer, K. & B. Altenberg (eds.), English corpus linguistics: studies in honour of Jan Svartvik. London: Longman. 8–29.
[4] Chris Callison-Burch and Mark Dredze (eds): Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, Association for Computational Linguistics, 2010,
[5] James Pustejovsky – Amber Stubbs: Natural Language Annotation for Machine Learning, O’Reilly Media, 2012