Jordi Atserias Batalla

Publication Profiles:
VERTa: a linguistic approach to automatic machine translation evaluation Elisabet Comelles and Jordi Atserias Machine translation (MT) is directly linked to its evaluation in order to both compare different MT system outputs and analyse system errors so that they can be addressed and corrected. As a consequence, MT evaluation has become increasingly important and popular in the last decade, leading to the development of MT evaluation metrics aiming at automatically assessing MT output. Most of these metrics use reference translations in order to compare system output, and the most well-known and widely spread work at lexical level. In this study we describe and present a linguistically-motivated metric, VERTa, which aims at using and combining a wide variety of linguistic features at lexical, morphological, syntactic and semantic level. Before designing and developing VERTa a qualitative linguistic analysis of data was performed so as to identify the linguistic phenomena that an MT metric must consider (Comelles et al. 2017). In the present study we introduce VERTa’s design and architecture and we report the experiments performed in order to develop the metric and to check the suitability and interaction of the linguistic information used. The experiments carried out go beyond traditional correlation scores and step towards a more qualitative approach based on linguistic analysis. Finally, in order to check the validity of the metric, an evaluation has been conducted comparing the metric’s performance to that of other well-known state-of-the-art MT metrics. Lang Resources & Evaluation 53, 57–86 (2019) 10.1007/s10579-018-9430-2 journal verta 2019 PDF
Through the eyes of verta Elisabet Comelles and Jordi Atserias This paper describes a practical demo of VERTa for Spanish. VERTa is an MT evaluation metric that combines linguistic features at different levels. VERTa has been developed for English and Spanish but can be easily adapted to other languages. VERTa can be used to evaluate adequacy, fluency and ranking of sentences. In this paper, VERTa’s modules are described briefly, as well as its graphical interface which provides information on VERTa’s performance and possible MT errors. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 366–372, Lisbon, Portugal. Association for Computational Linguistics.016 journal paper sepln verta 2016 machine translation machine translation evaluation PDF
VERTa: a Linguistically-motivated Metric at the WMT15 Metrics Task Elisabet Comelles and Jordi Atserias This paper describes VERTa’s submission to the 2015 EMNLP Workshop on Statistical Machine Translation. VERTa is a linguistically-motivated metric that combines linguistic features at different levels. In this paper, VERTa is described briefly, as well as the three versions submitted to the workshop: VERTa-70Adeq30Flu, VERTa-EQ and VERTa-W. Finally, the experiments conducted with the WMT14 data are reported and some conclusions are drawn. in Proceedings of the tenth workshop on statistical machine translation, wmt@EMNLP 2015, 17-18 september 2015, lisbon, portugal, 2015, pp. 366–372 10.18653/v1/W15-3045 paper 2015 verta wmt evaluation methodologies machine translation competition PDF
VERTa participation in the WMT14 Metrics Task Elisabet Comelles and Jordi Atserias We present VERTa, a linguistically-motivated metric that combines linguistic features at different levels. We provide the linguistic motivation on which the metric is based, as well as describe the different modules in VERTa and how they are combined. Finally, we describe the two versions of VERTa, VERTa-EQ and VERTa-W, sent to WMT14 and report results obtained in the experiments conducted with the WMT12 and WMT13 data into English. in Proceedings of the ninth workshop on statistical machine translation, wmt@ACL 2014, june 26-27, 2014, baltimore, maryland, USA, pp. 368–375 10.3115/v1/w14-3347 paper 2014 verta wmt evaluation methodologies machine translation competition PDF
VERTa: Facing a Multilingual Experience of a Linguistically-based MT Evaluation Elisabet Comelles, Jordi Atserias, Victoria Arranz, Irene Castellon and Jordi Sesé There are several MT metrics used to evaluate translation into Spanish, although most of them use partial or little linguistic information. In this paper we present the multilingual capability of VERTa, an automatic MT metric that combines linguistic information at lexical, morphological, syntactic and semantic level. In the experiments conducted we aim at identifying those linguistic features that prove the most effective to evaluate adequacy in Spanish segments. This linguistic information is tested both as independent modules (to observe what each type of feature provides) and in a combinatory fastion (where different kinds of information interact with each other). This allows us to extract the optimal combination. In addition we compare these linguistic features to those used in previous versions of VERTa aimed at evaluating adequacy for English segments. Finally, experiments show that VERTa can be easily adapted to other languages than English and that its collaborative approach correlates better with human judgements on adequacy than other well-known metrics. in Proceedings of the ninth international conference on language resources and evaluation, LREC 2014, reykjavik, iceland, may 26-31, 2014, 2014, pp. 2701–2707 paper 2014 verta lrec evaluation methodologies machine translation PDF
Using wikipedia for cross-language named entity recognition E. R. Fernandes, U. Brefeld, R. Blanco, and J. Atserias, Named entity recognition and classification (NERC) is fundamental for natural language processing tasks such as information extraction, question answering, and topic detection. State-of-the-art NERC systems are based on supervised machine learning and hence need to be trained on (manually) annotated corpora. However, annotated corpora hardly exist for non-standard languages and labeling additional data manually is tedious and costly. In this article, we present a novel method to automatically generate (partially) annotated corpora for NERC by exploiting the link structure of Wikipedia. Firstly, Wikipedia entries in the source language are labeled with the NERC tag set. Secondly, Wikipedia language links are exploited to propagate the annotations in the target language. Finally, mentions of the labeled entities in the target language are annotated with the respective tags. The procedure results in a partially annotated corpus that is likely to contain unannotated entities. To learn from such partially annotated data, we devise two simple extensions of hidden Markov models and structural perceptrons. Empirically, we observe that using the automatically generated data leads to more accurate prediction models than off-the-shelf NERC methods. We demonstrate that the novel extensions of HMMs and perceptrons effectively exploit the partially annotated data and outperforms their baseline counterparts in all settings. 5th international workshop on mining ubiquitous and social environments, muse 2014. In Big data analytics in the social and ubiquitous context - 5th international worksh 5th international workshop on mining ubiquitous and social environments, MUSE 2014, and first international workshop on machine learning for urban sensor data, senseml 2014, revised selected papers, 2014, vol. 9546, pp. 1–25 10.1007/978-3-319-29009-6_1 paper 2014 wikipedia entity recognition hide markov model hmm conditional random field crf named entity recognition and classification nerc PDF
FBM: Combining lexicon-based ML and heuristics for Social Media Polarities Carlos Rodríguez-Penagos, Jordi Atserias Batalla, Joan Codina-Filbà, David García-Narbona, Jens Grivolla, Patrik Lambert and Roser Saurí This paper describes the system implemented by Fundació Barcelona Media (FBM) for classifying the polarity of opinion expressions in tweets and SMSs, and which is supported by a UIMA pipeline for rich linguistic and sentiment annotations. FBM participated in the SEMEVAL 2013 Task 2 on polarity classification. It ranked 5th in Task A (constrained track) using an ensemble system combining ML algorithms with dictionary-based heuristics, and 7th (Task B, constrained) using an SVM classifier with features derived from the linguistic annotations and some heuristics. in Proceedings of the 7th international workshop on semantic evaluation, semeval@NAACL-hlt 2013, atlanta, georgia, usa, june 14-15, 2013, 2013, pp. 483–489 paper 2013 semeval sentiment analysis polarity of opinion tweets SMS PDF
Spell Checking in Spanish: The Case of Diacritic Accents Jordi Atserias, Maria Fuentes, Rogelio Nazar and Irene Renau This article presents the problem of diacritic restoration (or diacritization) in the context of spell-checking, with the focus on an orthographically rich language such as Spanish. We argue that despite the large volume of work published on the topic of diacritization, currently available spell-checking tools have still not found a proper solution to the problem in those cases where both forms of a word are listed in the checker’s dictionary. This is the case, for instance, when a word form exists with and without diacritics, such as continuo ‘continuous’ and continuó ‘he/she/it continued’, or when different diacritics make other word distinctions, as in continúo ‘I continue’. We propose a very simple solution based on a word bigram model derived from correctly typed Spanish texts and evaluate the ability of this model to restore diacritics in artificial as well as real errors. The case of diacritics is only meant to be an example of the possible applications for this idea, yet we believe that the same method could be applied to other kinds of orthographic or even grammatical errors. Moreover, given that no explicit linguistic knowledge is required, the proposed model can be used with other languages provided that a large normative corpus is available. in Proceedings of the eighth international conference on language resources and evaluation, LREC 2012, istanbul, turkey, may 23-25, 2012, 2012, pp. 737–742 paper 2012 lrec spell checking tweets PDF
FBM-Yahoo! at RepLab 2012 Jose M. Chenlo, Jordi Atserias, Carlos Rodriguez and Roi Blanco This paper describes FBM-Yahoo!’s participation in the profiling task of RepLab 2012, which aims at determining whether a given tweet is related to a specific company and, in if this being the case, whether it contains a positive or negative statement related to the company’s reputation or not. We addressed both problems (ambiguity and polarity reputation) using Support Vector Machines (SVM) classifiers and lexicon-based techniques, building automatically company profiles and bootstrapping background data. Concretely, for the ambiguity task we employed a linear SVM classifier with a token-based representation of relevant and irrelevant information extracted from the tweets and Freebase resources. With respect to polarity classification, we combined SVM lexicon-based approaches with bootstrapping in order to determine the final polarity label of a tweet in CLEF 2012 evaluation labs and workshop, online working notes, rome, italy, september 17-20, 2012, 2012, vol. 1178 paper 2012 replab clef tweets competition PDF
VERTa: Linguistic features in MT evaluation E. Comelles, J. Atserias, V. Arranz, and I. Castellón In the last decades, a wide range of automatic metrics that use linguistic knowledge has been developed. Some of them are based on lexical information, such as METEOR; others rely on the use of syntax, either using constituent or dependency analysis; and others use semantic information, such as Named Entities and semantic roles. All these metrics work at a specific linguistic level, but some researchers have tried to combine linguistic information, either by combining several metrics following a machine-learning approach or focusing on the combination of a wide variety of metrics in a simple and straightforward way. However, little research has been conducted on how to combine linguistic features from a linguistic point of view. In this paper we present VERTa, a metric which aims at using and combining a wide variety of linguistic features at lexical, morphological, syntactic and semantic level. We provide a description of the metric and report some preliminary experiments which will help us to discuss the use and combination of certain linguistic features in order to improve the metric performance in Proceedings of the eighth international conference on language resources and evaluation, LREC 2012, istanbul, turkey, may 23-25, 2012 paper 2012 machine translation evaluation methodologies PDF
Active Learning for Building a Corpus of Questions for Parsing J. Atserias, G. Attardi, M. Simi, and H. Zaragoza This paper describes how we built a dependency Treebank for questions. The questions for the Treebank were drawn from questions from the TREC 10 QA task and from Yahoo! Answers. Among the uses for the corpus is to train a dependency parser achieving good accuracy on parsing questions without hurting its overall accuracy. We also explore active learning techniques to determine the suitable size for a corpus of questions in order to achieve adequate accuracy while minimizing the annotation efforts. in Proceedings of the international conference on language resources and evaluation, LREC 2010, 17-23 may 2010, valletta, malta, 2010, vol. 800, pp. 9–080 paper 2010 corpus parsing question answering PDF Slides
Automatic Annotation of the Catalan Wikipedia: Exploring the Semantic Space via multiple NERC systems Jordi Atserias, Judith Domingo, Carlos Rodriguez, Teresa Suñol This paper presents WikiNer, a snapshot of the Catalan Wikipedia processed with different NLP tools (POS tagger, NERC, dependency parsers). The article focuses on the analysis of different NERC annotations using 3 taggers: JNET, YamCha and SST. Although Wikipedia text (specially in tables, lists, references) differs significantly in distributional properties from the corpora used to train the taggers, we believe that results of automatically annotating the semantic space of the Catalan Wikipedia point to the quick availability of a resource containing massive text annotated with a degree of reliability that is enough for some research tasks as well as for applications, such as simple Q&A, ontology enrichment and semantic search Proces. del Leng. Natural, vol. 45, pp. 169–173, 2010 journal paper 2010 sepln wikipedia nerc catalan PDF
Annotated Search and Element Retrieval Hugo Zaragoza, Michael Matthews, Roi Blanco, and Jordi Atserias Despite the great interest in different forms of textual annotation (named entity extraction, semantic tagging, syntactic and semantic parsing, etc.), there is still no consensus about which search tasks can be improved with such annotations, and what search algorithms are required to implement e cient engines to solve these tasks. We de ne formally two retrieval tasks in annotated collections: annotated retrieval and element retrieval. We discuss their differences and describe effcient indexing structures, and how they can be implemented in Lucene and MG4J, two open source retrieval engines. Finally, we give a technical overview of two element retrieval use cases. in Proceedings of first international workshop on living web, collocated with the 8th international semantic web conference (iswc-2009), washington, dc, usa, october 26, 2009, vol. 515 paper 2009 PDF
Semantically Annotated Snapshot of the English Wikipedia J. Atserias, H. Zaragoza, M. Ciaramita, and G. Attardi This paper describes SW1, the first version of a semantically annotated snapshot of the English Wikipedia. In recent years Wikipedia has become a valuable resource for both the Natural Language Processing (NLP) community and the Information Retrieval (IR) community. Although NLP technology for processing Wikipedia already exists, not all researchers and developers have the computational resources to process such a volume of information. Moreover, the use of different versions of Wikipedia processed differently might make it difficult to compare results. The aim of this work is to provide easy access to syntactic and semantic annotations for researchers of both NLP and IR communities by building a reference corpus to homogenize experiments and make results comparable. These resources, a semantically annotated corpus and a ’entity containment’ graph. paper 2008 lrec corpus information extraction information retrieval acquisition machine learning PDF Slides
Complete and Consistent Annotation of WordNet using the Top Concept Ontology Javier Álvez, Jordi Atserias, Jordi Carrera, Salvador Climent, Egoitz Laparra, Antoni Oliver and German Rigau This paper presents the complete and consistent ontological annotation of the nominal part of WordNet. The annotation has been carried out using the semantic features defined in the EuroWordNet Top Concept Ontology and made available to the NLP community. Up to now only an initial core set of 1,024 synsets, the so-called Base Concepts, was ontologized in such a way. The work has been achieved by following a methodology based on an iterative and incremental expansion of the initial labeling through the hierarchy while setting inheritance blockage points. Since this labeling has been set on the EuroWordNet’s Interlingual Index (ILI), it can be also used to populate any other wordnet linked to it through a simple porting process. This feature-annotated WordNet is intended to be useful for a large number of semantic NLP tasks and for testing for the first time componential analysis on real environments. Moreover, the quantitative analysis of the work shows that more than 40% of the nominal part of WordNet is involved in structure errors or inadequacies. in Proceedings of the international conference on language resources and evaluation, LREC 2008, 26 may - 1 june 2008, marrakech, morocco, 2008 paper 2008 lrec ontologies semantics lexicon lexical database wordnet PDF Slides
Learning to tag and tagging to learn: A case study on wikipedia Peter Mika, Massimiliano Ciaramita, Hugo Zaragoza and Jordi Atserias The problem of semantically annotating Wikipedia inspires a novel method for dealing with domain and task adaptation of semantic taggers in cases where parallel text and metadata are available. IEEE Intelligent Systems, vol. 23, no. 5, pp. 26–33, 2008. 10.1109/MIS.2008.85 journal 2008 PDF
PoS Tagging with a Named Entity Tagger M. Ciaramita and J. Atserias in Proceedings of the final evalita 2007 workshop, 2007. paper 2007 competition PDF
Named Entity Tagging with a PoS Tagger M. Ciaramita and J. Atserias in Proceedings of the final evalita 2007 workshop, 2007. paper 2007 competition PDF
PoS Tagging with a Named Entity Tagger M. Ciaramita and J. Atserias Intelligenza Artificiale, special issue (EVALITA 2007), vol. 2, 2007. journal 2007 competition
Named Entity Tagging with a PoS Tagger M. Ciaramita and J. Atserias Intelligenza Artificiale, special issue (EVALITA 2007), vol. 2, 2007. journal 2007 competition
World knowledge in broad-coverage information filtering B. A. Hagedorn, M. Ciaramita, and J. Atserias in SIGIR 2007: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, amsterdam, the netherlands, july 23-27, 2007, 2007, pp. 801–802 10.1145/1277741.1277916 paper 2007
Ranking very many typed entities on wikipedia Hugo Zaragoza, Henning Rode, Peter Mika, Jordi Atserias, Massimiliano Ciaramita, Giuseppe Attardi We discuss the problem of ranking very many entities of different types. In particular we deal with a heterogeneous set of types, some being very generic and some very specific. We discuss two approaches for this problem: i) exploiting the entity containment graph and ii) using a Web search engine to compute entity relevance. We evaluate these approaches on the real task of ranking Wikipedia entities typed with a state-of-the-art named-entity tagger. Results show that both approaches can greatly increase the performance of methods based only on passage retrieval. 10.1145/1321440.1321599 paper 2007 PDF
Towards Robustness in Natural Language Understanding Jordi Atserias Most of the different tasks included in Natural Language Processing (nlp) (such as, Word Sense Disambiguation, Information Retrieval, Information Extraction, Question Answering, Information Filtering, Natural Language Interfaces, Story Understanding or Machine Translation) apply different levels of Natural Language Understanding (nlu). This thesis explores a new integrated architecture for robust nlu, exploiting constraint-based optimization techniques. The goal of this work is towards robust and flexible architectures able to deal with the complexity of advanced nlp. In particular, we present a novel architecture (pardon), orthogonal to the traditional nlp task decomposition, which applies any kind of knowledge (syntactic, semantic, linguistic, statistical) at the earliest opportunity while retaining an independent representation of the different kinds of knowledge. The different architectures proposed for nlu can be classified based on two main dimensions, namely, the level of integration of their processes and the level of integration of their data. An easier modularization aimed at focusing on a particular nlp task and competitions (e.g. MUC, TREC, etc) have lead most of the researchers to adopt a pipelined or stratified architecture. However, this architecture shows several drawbacks which has made us consider the use of integrated and interactive approaches. In order to implement such approaches, we will also introduce the Consistent Labeling Problems (clps), a specific case of Constraint Satisfaction Problems that can be solved eficiently by a set of iterative algorithms (e.g. relaxation labeling). Constraints allow us to integrate both processes and knowledge in the same framework. On the one hand, many forms of ambiguity can be represented in a compact and elegant manner, and processed eficiently by means of constraints. On the other hand, many nlp processes (e.g., many wsd techniques) could also be represented as constraints. Inside the pardon architecture, an object uses its models to combine itself with other objects. During this combination, some of its attribute values are determined (in a similar way to Hearst’s Polaroid Words [Hirst, 1987]). Roughly speaking, pardon combines objects from one level in order to build the objects corresponding to the next level of the task under consideration. This combination is carried out by using lexicalized models. That is, these models must be anchored-in/trigged by a first-level object. pardon represents the relationships between objects in a dependency-like style, with models and roles. In order to avoid the combinatorial explosion of possible object combinations, this framework is formalized as a Consistent Labeling Problem (clp). Thus, it can be solved using optimization methods (e.g. the relaxation labeling algorithm) to find the most consistent solution. pardon aims to give a general framework, that is multilingual and open domain, in which different nlp tasks can be easily formalized. These different tasks can be tested separately or carried out simultaneously following an integrated approach. Pursuing this goal, we have also integrated several resources in a multilingual knowledge base, named Multilingual Central Repository (mcr). mcr has been built aroundWordNet, using the EuroWordNet architecture. This multilingual repository integrates different resources, ontologies (sumo, Top Concept Ontology), thematic classifications (Domains), local wordnets of have diferent languages, and so on. The new architecture proposed by pardon has been successfully applied to two different nlu tasks involved in Semantic Interpretation, namely Semantic Role Labeling (srl) and Word Sense Disambiguation (wsd). Usually, Word Sense Disambiguation and Semantic Role Labeling are considered separately although they are strongly related. wsd can improve results in srl (as different senses have different syntactic behaviours, specially verbs) and vice-versa (e.g. using verbal preferences for wsd)." tesis 2006 pardon semantic role labeling consistent labeling problem relaxation labeling word sense disambiguation srl wsd PDF
FreeLing 1.3: Syntactic and semantic services in an open-source NLP library J. Atserias, B. Casas, E. Comelles, M. González, L. Padró, and M. Padró This paper describes version 1.3 of the FreeLing suite of NLP tools. FreeLing was first released in February 2004 providing morphological analysis and PoS tagging for Catalan, Spanish, and English. From then on, the package has been improved and enlarged to cover more languages (i.e. Italian and Galician) and offer more services: Named entity recognition and classification, chunking, dependency parsing, and WordNet based semantic annotation. FreeLing is not conceived as end-user oriented tool, but as library on top of which powerful NLP applications can be developed. Nevertheless, sample interface programs are provided, which can be straightforwardly used as fast, flexible, and efficient corpus processing tools. A remarkable feature of FreeLing is that it is distributed under a free-software LGPL license, thus enabling any developer to adapt the package to his needs in order to get the most suitable behaviour for the application being developed. in Proceedings of the fifth international conference on language resources and evaluation, LREC 2006, genoa, italy, may 22-28, 2006, 2006, pp. 2281–2286 paper 2006 lrec freeling nlp tools corpus analysis tools morphological analysis pos tagging nerc named entity recognition and classification semantic annotation free software. PDF
Multiwords and Word Sense Disambiguation Victoria Arranz, Jordi Atserias and Mauro Castillo This paper studies the impact of multiword expressions on Word Sense Disambiguation (WSD). Several identification strategies of the multiwords in WordNet2.0 are tested in a real Senseval-3 task: the disambiguation of WordNet glosses. Although we have focused on Word Sense Disambiguation, the same techniques could be applied in more complex tasks, such as Information Retrieval or Question Answering in Computational linguistics and intelligent text processing, 6th international conference, cicling 2005, mexico city, mexico, february 13-19, 2005, proceedings, 2005, vol. 3406, pp. 250–262 10.1007/978-3-540-30586-6_28 paper 2005 wsd multiwords word sense disambiguation wordnet PDF
Un Enfoque Integrado para la Desambiguación Jordi Atserias This paper presents an extension for WSD of an integrated arquitecture disigned for Semantic Parsing. In the proposed framework, both tasks could be adressed simultaneously, colaborating between them. The feasibility and robustness of the proposed arquitecture have been proved against a well-defined task on WSD (the SENSEVAL-II English Lexical Sample) using automatically acquired models. in XXI congreso de la sociedad española para el procesamiento del lenguaje natural (sepln’05), 2005, pp. 179–186. paper 2005 sepln wsd semantic parsing PDF
Artificial intelligence and computer science Jordi Atserias S. Shannon, Ed. Nova Science Publisher Inc., 2005, pp. 177–196. book 2005
TXALA un analizador libre de dependencias para el castellano Jordi Atserias Batalla, Elisabet Comelles Pujadas and Aingeru Mayor In this demo we present the first version of Txala, a dependency parser for Spanish developed under LGPL license. This parser is framed in the development of a free-software platform for Machine Translation. Due to the lack of this kind of syntactic parsers for Spanish, this tool is essential for the development of NLP in Spanish. in XXI congreso de la sociedad española para el procesamiento del lenguaje natural (sepln'05), 2005, pp. 455–456. journal paper 2005 sepln syntax parsing dependecy parser nlp tools PDF
An Integrated Approach to Word Sense Disambiguation J. Atserias, L. Padró, and G. Rigau in Recent advances in natural language processing (ranlp’05), 2005, pp. 82–88. paper 2005
A Proposal for a Shallow Ontologization of Wordnet Salvador Climent, Jordi Atserias Batalla, Joaquim Moré López and German Rigau Claramunt This paper presents the work carried out towards the so-called shallow ontologization of WordNet, which is argued to be a way to overcome most of the many structural problems of the widely used lexical knowledge base. The result shall be a multilingual resource more suitable for large-scale semantic processing. Proces. del Leng. Natural, vol. 35, pp. 161–167, 2005 journal paper sepln 2005 wordnet ontologies PDF
The MEANING Multilingual Central Repository Jordi Atserias, Luís Villarejo, German Rigau, Eneko Agirre, John Carroll, Bernardo Magnini and Piek Vossen This paper describes the first version of the Multilingual Central Repository, a lexical knowledge base developed in the framework of the MEANING project. Currently the MCR integrates into the EuroWordNet framework five local wordnets (including four versions of the English WordNet from Princeton), an upgraded version of the EuroWordNet Top Concept ontology, the MultiWordNet Domains, the Suggested Upper Merged Ontology (SUMO) and hundreds of thousand of new semantic relations and properties automatically acquired from corpora. We believe that the resulting MCR will be the largest and richest Multilingual Lexical Knowledge Base in existence. in 2nd International global wordnet conference (gwc'04), 2004, pp. 23-30. paper 2004
Cross-Language Acquisition of Semantic Models for Verbal Predicates Jordi Atserias, Bernardo Magnini, Octavian Popescu, Eneko Agirre, Aitziber Atutxa, German Rigau, John Carroll, Rob Koeling This paper presents a semantic-driven methodology for the automatic acquisition of verbal models. Our approach relies strongly on the semantic generalizations allowed by already existing resources (e.g. Domain labels, Named Entity categories, concepts in the SUMO ontology, etc). Several experiments have been carried out using comparable corpora in four languages (Italian, Spanish, Basque and English) and two domains (FINANCE and SPORT) showing that the semantic patterns acquired can be general enough to be ported from one language to the other language. in 4th International conference on language resources and evaluation (lrec'04), 2004, pp. 33-36. paper 2004 lrec multilingual knowledge acquisition PDF
Towards the MEANING Top Ontology: Sources of Ontological Meaning Jordi Atserias, Salvador Climent, German Rigau This paper describes the initial research steps towards the Top Ontology for the Multilingual Central Repository (Mcr) built in the Meaning project. The current version of the Mcr integrates five local wordnets plus four versions of Princeton’s English WordNet, three ontologies and hundreds of thousands of new semantic relations and properties automatically acquired from corpora. In order to maintain compatibility among all these heterogeneous knowledge resources, it is fundamental to have a robust and advanced ontological support. This paper studies the mapping of main Sources of Ontological Meaning onto the wordnets and, in particular, the current work in mapping the EuroWordNet Top Concept Ontology. in 4th International conference on language resources and evaluation (lrec’04), 2004, pp. 11–14. paper 2004 lrec wordnet eurowordnet ontologies PDF
Spanish WordNet 1.6: Porting the Spanish Wordnet across Princeton versions Jordi Atserias, Luís Villarejo, German Rigau This paper describes the new Spanish Wordnet aligned to Princeton WordNet1.6 and the analysis of the transformation from the previous version aligned to Princeton WordNet1.5. Although a mapping technology exists, to our knowledge it is the first time a whole local wordnet has been ported to a newer release of the Princeton WordNet. in 4th International conference on language resources and evaluation (lrec’04), 2004, pp. 161–164. paper 2004 lrec wordnet eurowordnet PDF
The TALP Systems for Disambiguating WordNet Glosses Mauro Castillo, Francis Real, Jordi Atserias and German Rigau This paper presents a summary report on the empirical results obtained on the SENSEVAL-3 task 12 “Word-Sense Disambiguation of WordNet Glosses”. Our method combines a set of knowledge-based heuristics integrating several information resources and techniques. From the ten systems presented at the taks, our systems obtained the first and third positions. paper 2004 competition senseval wordnet wsd PDF
Automatic Acquisition of Sense Examples Using ExRetriever M. Cuadros, J. Atserias, M. Castillo, and G. Rigau A current research line for word sense disambiguation (WSD) focuses on the use of supervised machine learning techniques. One of the drawbacks of using such techniques is that previously sense annotated data is required. This paper presents ExRetriever, a new software tool for automatically acquiring large sets of sense tagged examples from large collections of text and the Web. ExRetriever exploits the knowledge contained in large-scale knowledge bases (e.g., WordNet) to build complex queries, each of them characterising particular senses of a word. These examples can be used as training instances for supervised WSD algorithms. in IBERAMIA workshop on lexical resources and the web for word sense disambiguation, 2004, pp. 97–104. paper 2004 wsd PDF
Automatic Acquisition of Sense Examples Using ExRetriever J. Fernández, M. Castillo, G. Rigau, J. Atserias, and J. Turmo A current research line for word sense disambiguation (WSD) focuses on the use of supervised machine learning techniques. One of the drawbacks of using such techniques is that previously sense annotated data is required. This paper presents ExRetriever, a new software tool for automatically acquiring large sets of sense tagged examples from large collections of text and the Web. ExRetriever exploits the knowledge contained in large-scale knowledge bases (e.g., WordNet) to build complex queries, each of them characterising particular senses of a word. These examples can be used as training instances for supervised WSD algorithms. in Proceedings of the fourth international conference on language resources and evaluation, LREC 2004, may 26-28, lisbon, portugal, 2004 paper 2004 acquisition WordNet WSD PDF
Exploring large-scale Acquisition of Multilingual Semantic Models for Predicates Jordi Atserias, Mauro Castillo, Francis Real, Horacio Rodríguez and Germán Rigau paper journal sepln 2003 PDF
Starting up the Multilingual Central Repository Jordi Atserias, German Rigau and Luís Villarejo paper journal sepln 2003 PDF
Integrating and porting Knowleges across Languages Jordi Atserias, German Rigau and Luís Villarejo in Recent advances in natural language processing (ranlp'03), 2003, pp. 31-37. ISBN 954-90906-6-3 paper 2003 PDF
Starting up the Multilingual Central Repository Jordi Atserias, German Rigau and Luís Villarejo journal paper sepln 2003 PDF
The MEANING project Germán Rigau, Eneko Agirre and Jordi Atserias journal paper sepln 2003 PDF
First Release of the Multilingual Central Repository of MEANING Luís Villarejo, Jordi Atserias, Gerard Escudero, and German Rigau demo sepln 2003 PDF
Integrating Multiple Knowledge Sources for Robust Semantic Parsing Jordi Atserias, Lluís Padró and Germán Rigau This work explores a new robust approach for Semantic Parsing of unrestricted texts. Our approach considers Semantic Parsing as a Consistent Labelling Problem (clp), allowing the integration of several knowledge types (syntactic and semantic) obtained from different sources (linguistic and statistic). The current implementation obtains 95% accuracy in model identification and 72% in case-role filling paper 2001 PDF
Combining Supervised and Unsupervised Lexical Knowledge Methods for Word Sense Disambiguation E. Agirre, G. Rigau, L. Padró, and J. Atserias Comput. Humanit., vol. 34, nos. 1-2, pp. 103-108, 2000 10.1023/A:1002486301447 paper 2000 PDF
Semantic Analysis based on Verbal Subcategorization J. Atserias, I. Castellón, M. Civit, and G. Rigau in Conference on intelligent text processing and computational linguistics (cicling'00), 2000, pp. 330-340. paper 2000 PDF
Using Diathesis for Semantic Parsing J. Atserias, I. Castellón, M. Civit, and G. Rigau in Venecia per il trattamento automatico delle lingue (vextal'99), 1999, pp. 385-392. paper 1999 PDF
Morphosyntactic Analysis and Parsing of Unrestricted Spanish Text Jordi Atserias i Batalla, Josep Carmona Vargas, Irene Castellón Masalles, Sergi Cervell, Montserrat Civit Torruella, Lluís Màrquez, María Antonia Martí Antonín, Lluís Padró Cirera, Roberto Placer, Horacio Rodríguez Hontoria, Mariona Taulé Delor, Jordi Turmo This online demonstration is about an environment for massive processing of unrestricted Spanish text. The system consists of three stages: morphological analysis, POS disambiguation and parsing. The output of each can be pipelined into the next. The first two phases are described in (Carmona et al., 1998) and the third is described in (Atserias et al., 1998), both published in this conference. The execution may be performed inside the GATE environment, which enables visualization and analysis of intermediate results, or either in background, if higher eeciency is required for massive text processing. paper 1998 demo gate lrec PDF
Syntactic Parsing of Unrestricted Spanish Text I. Castellón, M. Civit, and J. Atserias This research focusses on the syntactical parsing of morphologycal tagged corpora. A proposal for a corpus oriented Spanish grammar is presented in this document. This work has been developed in the framework of the ITEM project and its main goal is to provide multilingual background for information extraction and retrieval tasks. The main goal of Tacat analyser is to provide a way of obtaining large amounts of bracketed and parsed corpora, both general land specific domain. Tacat uses context free grammars and has as input following categories of Parole specification.The incremental methodology that we use allows us to recognise different levels of complexity in the analysis and to produce compatible outputs of all the grammars in 1st International conference on language resources and evaluation (lrec'98), 1998, pp. 603-609. paper 1998 PDF
Combining Multiple Methods for the Automatic Construction of Multilingual WordNets Jordi Atserias, Salvador Climent, Xavier Farreres, German Rigau and Horacio Rodríguez This paper explores the automatic construction of a multilingual Lexical Knowledge Base from preexisting lexical resources. First, a set of automatic and complementary techniques for linking Spanish words collected from monolingual and bilingual MRDs to English WordNet synsets are described. Second, we show how resulting data provided by each method is then combined to produce a preliminary version of a Spanish WordNet with an accuracy over 85%. The application of these combinations results on an increment of the extracted connexions of a 40% without losing accuracy. Both coarse-grained (class level) and fine-grained (synset assignment level) confidence ratios are used and evaluated. Finally, the results for the whole process are presented. in Recent advances in natural language (ranlp'97), 1997, pp. 143-149 10.1075/cilt.189.32ats paper 1997 PDF
Combining Unsupervised Lexical Knowledge Methods for Word Sense Disambiguation German Rigau, Jordi Atserias, and Eneko Agirre This paper presents a method to combine a set of unsupervised algorithms that can accurately disambiguate word senses in a large, completely untagged corpus. Although most of the techniques for word sense resolution have been presented as stand-alone, it is our belief that full-fledged lexical ambiguity resolution should combine several information sources and techniques. The set of techniques have been applied in a combined way to disambiguate the genus terms of two machine-readable dictionaries (MRD), enabling us to construct complete taxonomies for Spanish and French. Tested accuracy is above 80% overall and 95% for two-way ambiguous genus terms, showing that taxonomy building is not limited to structured dictionaries such as LDOCE. in Joint 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics (ACL/EACL’97), 1997, pp. 48–55 10.3115/976909.979624 paper 1997 acl wsd PDF

VERTa: a linguistic approach to automatic machine translation evaluation

Machine translation (MT) is directly linked to its evaluation in order to both compare different MT system outputs and analyse system errors so that they can be addressed and corrected. As a consequence, MT evaluation has become increasingly important and popular in the last decade, leading to the development of MT evaluation metrics aiming at automatically assessing MT output. Most of these metrics use reference translations in order to compare system output, and the most well-known and widely spread work at lexical level. In this study we describe and present a linguistically-motivated metric, VERTa, which aims at using and combining a wide variety of linguistic features at lexical, morphological, syntactic and semantic level. Before designing and developing VERTa a qualitative linguistic analysis of data was performed so as to identify the linguistic phenomena that an MT metric must consider (Comelles et al. 2017). In the present study we introduce VERTa’s design and architecture and we report the experiments performed in order to develop the metric and to check the suitability and interaction of the linguistic information used. The experiments carried out go beyond traditional correlation scores and step towards a more qualitative approach based on linguistic analysis. Finally, in order to check the validity of the metric, an evaluation has been conducted comparing the metric’s performance to that of other well-known state-of-the-art MT metrics.

Lang Resources & Evaluation 53, 57–86 (2019)

10.1007/s10579-018-9430-2

journal verta 2019

PDF

Through the eyes of verta

Elisabet Comelles and Jordi Atserias

This paper describes a practical demo of VERTa for Spanish. VERTa is an MT evaluation metric that combines linguistic features at different levels. VERTa has been developed for English and Spanish but can be easily adapted to other languages. VERTa can be used to evaluate adequacy, fluency and ranking of sentences. In this paper, VERTa’s modules are described briefly, as well as its graphical interface which provides information on VERTa’s performance and possible MT errors.

In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 366–372, Lisbon, Portugal. Association for Computational Linguistics.016

journal paper sepln verta 2016 machine translation machine translation evaluation

PDF

VERTa: a Linguistically-motivated Metric at the WMT15 Metrics Task

Elisabet Comelles and Jordi Atserias

This paper describes VERTa’s submission to the 2015 EMNLP Workshop on Statistical Machine Translation. VERTa is a linguistically-motivated metric that combines linguistic features at different levels. In this paper, VERTa is described briefly, as well as the three versions submitted to the workshop: VERTa-70Adeq30Flu, VERTa-EQ and VERTa-W. Finally, the experiments conducted with the WMT14 data are reported and some conclusions are drawn.

in Proceedings of the tenth workshop on statistical machine translation, wmt@EMNLP 2015, 17-18 september 2015, lisbon, portugal, 2015, pp. 366–372

10.18653/v1/W15-3045

paper 2015 verta wmt evaluation methodologies machine translation competition

PDF

VERTa participation in the WMT14 Metrics Task

Elisabet Comelles and Jordi Atserias

We present VERTa, a linguistically-motivated metric that combines linguistic features at different levels. We provide the linguistic motivation on which the metric is based, as well as describe the different modules in VERTa and how they are combined. Finally, we describe the two versions of VERTa, VERTa-EQ and VERTa-W, sent to WMT14 and report results obtained in the experiments conducted with the WMT12 and WMT13 data into English.

in Proceedings of the ninth workshop on statistical machine translation, wmt@ACL 2014, june 26-27, 2014, baltimore, maryland, USA, pp. 368–375

10.3115/v1/w14-3347

paper 2014 verta wmt evaluation methodologies machine translation competition

PDF

VERTa: Facing a Multilingual Experience of a Linguistically-based MT Evaluation

Elisabet Comelles, Jordi Atserias, Victoria Arranz, Irene Castellon and Jordi Sesé

There are several MT metrics used to evaluate translation into Spanish, although most of them use partial or little linguistic information. In this paper we present the multilingual capability of VERTa, an automatic MT metric that combines linguistic information at lexical, morphological, syntactic and semantic level. In the experiments conducted we aim at identifying those linguistic features that prove the most effective to evaluate adequacy in Spanish segments. This linguistic information is tested both as independent modules (to observe what each type of feature provides) and in a combinatory fastion (where different kinds of information interact with each other). This allows us to extract the optimal combination. In addition we compare these linguistic features to those used in previous versions of VERTa aimed at evaluating adequacy for English segments. Finally, experiments show that VERTa can be easily adapted to other languages than English and that its collaborative approach correlates better with human judgements on adequacy than other well-known metrics.

in Proceedings of the ninth international conference on language resources and evaluation, LREC 2014, reykjavik, iceland, may 26-31, 2014, 2014, pp. 2701–2707

paper 2014 verta lrec evaluation methodologies machine translation

PDF

Using wikipedia for cross-language named entity recognition

E. R. Fernandes, U. Brefeld, R. Blanco, and J. Atserias,

Named entity recognition and classification (NERC) is fundamental for natural language processing tasks such as information extraction, question answering, and topic detection. State-of-the-art NERC systems are based on supervised machine learning and hence need to be trained on (manually) annotated corpora. However, annotated corpora hardly exist for non-standard languages and labeling additional data manually is tedious and costly. In this article, we present a novel method to automatically generate (partially) annotated corpora for NERC by exploiting the link structure of Wikipedia. Firstly, Wikipedia entries in the source language are labeled with the NERC tag set. Secondly, Wikipedia language links are exploited to propagate the annotations in the target language. Finally, mentions of the labeled entities in the target language are annotated with the respective tags. The procedure results in a partially annotated corpus that is likely to contain unannotated entities. To learn from such partially annotated data, we devise two simple extensions of hidden Markov models and structural perceptrons. Empirically, we observe that using the automatically generated data leads to more accurate prediction models than off-the-shelf NERC methods. We demonstrate that the novel extensions of HMMs and perceptrons effectively exploit the partially annotated data and outperforms their baseline counterparts in all settings.

5th international workshop on mining ubiquitous and social environments, muse 2014. In Big data analytics in the social and ubiquitous context - 5th international worksh 5th international workshop on mining ubiquitous and social environments, MUSE 2014, and first international workshop on machine learning for urban sensor data, senseml 2014, revised selected papers, 2014, vol. 9546, pp. 1–25

10.1007/978-3-319-29009-6_1

paper 2014 wikipedia entity recognition hide markov model hmm conditional random field crf named entity recognition and classification nerc

PDF

FBM: Combining lexicon-based ML and heuristics for Social Media Polarities

Carlos Rodríguez-Penagos, Jordi Atserias Batalla, Joan Codina-Filbà, David García-Narbona, Jens Grivolla, Patrik Lambert and Roser Saurí

This paper describes the system implemented by Fundació Barcelona Media (FBM) for classifying the polarity of opinion expressions in tweets and SMSs, and which is supported by a UIMA pipeline for rich linguistic and sentiment annotations. FBM participated in the SEMEVAL 2013 Task 2 on polarity classification. It ranked 5th in Task A (constrained track) using an ensemble system combining ML algorithms with dictionary-based heuristics, and 7th (Task B, constrained) using an SVM classifier with features derived from the linguistic annotations and some heuristics.

in Proceedings of the 7th international workshop on semantic evaluation, semeval@NAACL-hlt 2013, atlanta, georgia, usa, june 14-15, 2013, 2013, pp. 483–489

paper 2013 semeval sentiment analysis polarity of opinion tweets SMS

PDF

Spell Checking in Spanish: The Case of Diacritic Accents

Jordi Atserias, Maria Fuentes, Rogelio Nazar and Irene Renau

This article presents the problem of diacritic restoration (or diacritization) in the context of spell-checking, with the focus on an orthographically rich language such as Spanish. We argue that despite the large volume of work published on the topic of diacritization, currently available spell-checking tools have still not found a proper solution to the problem in those cases where both forms of a word are listed in the checker’s dictionary. This is the case, for instance, when a word form exists with and without diacritics, such as continuo ‘continuous’ and continuó ‘he/she/it continued’, or when different diacritics make other word distinctions, as in continúo ‘I continue’. We propose a very simple solution based on a word bigram model derived from correctly typed Spanish texts and evaluate the ability of this model to restore diacritics in artificial as well as real errors. The case of diacritics is only meant to be an example of the possible applications for this idea, yet we believe that the same method could be applied to other kinds of orthographic or even grammatical errors. Moreover, given that no explicit linguistic knowledge is required, the proposed model can be used with other languages provided that a large normative corpus is available.

in Proceedings of the eighth international conference on language resources and evaluation, LREC 2012, istanbul, turkey, may 23-25, 2012, 2012, pp. 737–742

paper 2012 lrec spell checking tweets

PDF

FBM-Yahoo! at RepLab 2012

Jose M. Chenlo, Jordi Atserias, Carlos Rodriguez and Roi Blanco

This paper describes FBM-Yahoo!’s participation in the profiling task of RepLab 2012, which aims at determining whether a given tweet is related to a specific company and, in if this being the case, whether it contains a positive or negative statement related to the company’s reputation or not. We addressed both problems (ambiguity and polarity reputation) using Support Vector Machines (SVM) classifiers and lexicon-based techniques, building automatically company profiles and bootstrapping background data. Concretely, for the ambiguity task we employed a linear SVM classifier with a token-based representation of relevant and irrelevant information extracted from the tweets and Freebase resources. With respect to polarity classification, we combined SVM lexicon-based approaches with bootstrapping in order to determine the final polarity label of a tweet

in CLEF 2012 evaluation labs and workshop, online working notes, rome, italy, september 17-20, 2012, 2012, vol. 1178

paper 2012 replab clef tweets competition

PDF

VERTa: Linguistic features in MT evaluation

E. Comelles, J. Atserias, V. Arranz, and I. Castellón

In the last decades, a wide range of automatic metrics that use linguistic knowledge has been developed. Some of them are based on lexical information, such as METEOR; others rely on the use of syntax, either using constituent or dependency analysis; and others use semantic information, such as Named Entities and semantic roles. All these metrics work at a specific linguistic level, but some researchers have tried to combine linguistic information, either by combining several metrics following a machine-learning approach or focusing on the combination of a wide variety of metrics in a simple and straightforward way. However, little research has been conducted on how to combine linguistic features from a linguistic point of view. In this paper we present VERTa, a metric which aims at using and combining a wide variety of linguistic features at lexical, morphological, syntactic and semantic level. We provide a description of the metric and report some preliminary experiments which will help us to discuss the use and combination of certain linguistic features in order to improve the metric performance

in Proceedings of the eighth international conference on language resources and evaluation, LREC 2012, istanbul, turkey, may 23-25, 2012

paper 2012 machine translation evaluation methodologies

PDF

Active Learning for Building a Corpus of Questions for Parsing

J. Atserias, G. Attardi, M. Simi, and H. Zaragoza

This paper describes how we built a dependency Treebank for questions. The questions for the Treebank were drawn from questions from the TREC 10 QA task and from Yahoo! Answers. Among the uses for the corpus is to train a dependency parser achieving good accuracy on parsing questions without hurting its overall accuracy. We also explore active learning techniques to determine the suitable size for a corpus of questions in order to achieve adequate accuracy while minimizing the annotation efforts.

in Proceedings of the international conference on language resources and evaluation, LREC 2010, 17-23 may 2010, valletta, malta, 2010, vol. 800, pp. 9–080

paper 2010 corpus parsing question answering

PDF Slides

Automatic Annotation of the Catalan Wikipedia: Exploring the Semantic Space via multiple NERC systems

Jordi Atserias, Judith Domingo, Carlos Rodriguez, Teresa Suñol

This paper presents WikiNer, a snapshot of the Catalan Wikipedia processed with different NLP tools (POS tagger, NERC, dependency parsers). The article focuses on the analysis of different NERC annotations using 3 taggers: JNET, YamCha and SST. Although Wikipedia text (specially in tables, lists, references) differs significantly in distributional properties from the corpora used to train the taggers, we believe that results of automatically annotating the semantic space of the Catalan Wikipedia point to the quick availability of a resource containing massive text annotated with a degree of reliability that is enough for some research tasks as well as for applications, such as simple Q&A, ontology enrichment and semantic search

Proces. del Leng. Natural, vol. 45, pp. 169–173, 2010

journal paper 2010 sepln wikipedia nerc catalan

PDF

Annotated Search and Element Retrieval

Hugo Zaragoza, Michael Matthews, Roi Blanco, and Jordi Atserias

Despite the great interest in different forms of textual annotation (named entity extraction, semantic tagging, syntactic and semantic parsing, etc.), there is still no consensus about which search tasks can be improved with such annotations, and what search algorithms are required to implement e cient engines to solve these tasks. We de ne formally two retrieval tasks in annotated collections: annotated retrieval and element retrieval. We discuss their differences and describe effcient indexing structures, and how they can be implemented in Lucene and MG4J, two open source retrieval engines. Finally, we give a technical overview of two element retrieval use cases.

in Proceedings of first international workshop on living web, collocated with the 8th international semantic web conference (iswc-2009), washington, dc, usa, october 26, 2009, vol. 515

paper 2009

PDF

Semantically Annotated Snapshot of the English Wikipedia

J. Atserias, H. Zaragoza, M. Ciaramita, and G. Attardi

This paper describes SW1, the first version of a semantically annotated snapshot of the English Wikipedia. In recent years Wikipedia has become a valuable resource for both the Natural Language Processing (NLP) community and the Information Retrieval (IR) community. Although NLP technology for processing Wikipedia already exists, not all researchers and developers have the computational resources to process such a volume of information. Moreover, the use of different versions of Wikipedia processed differently might make it difficult to compare results. The aim of this work is to provide easy access to syntactic and semantic annotations for researchers of both NLP and IR communities by building a reference corpus to homogenize experiments and make results comparable. These resources, a semantically annotated corpus and a ’entity containment’ graph.

paper 2008 lrec corpus information extraction information retrieval acquisition machine learning

PDF Slides

Complete and Consistent Annotation of WordNet using the Top Concept Ontology

Javier Álvez, Jordi Atserias, Jordi Carrera, Salvador Climent, Egoitz Laparra, Antoni Oliver and German Rigau

This paper presents the complete and consistent ontological annotation of the nominal part of WordNet. The annotation has been carried out using the semantic features defined in the EuroWordNet Top Concept Ontology and made available to the NLP community. Up to now only an initial core set of 1,024 synsets, the so-called Base Concepts, was ontologized in such a way. The work has been achieved by following a methodology based on an iterative and incremental expansion of the initial labeling through the hierarchy while setting inheritance blockage points. Since this labeling has been set on the EuroWordNet’s Interlingual Index (ILI), it can be also used to populate any other wordnet linked to it through a simple porting process. This feature-annotated WordNet is intended to be useful for a large number of semantic NLP tasks and for testing for the first time componential analysis on real environments. Moreover, the quantitative analysis of the work shows that more than 40% of the nominal part of WordNet is involved in structure errors or inadequacies.

in Proceedings of the international conference on language resources and evaluation, LREC 2008, 26 may - 1 june 2008, marrakech, morocco, 2008

paper 2008 lrec ontologies semantics lexicon lexical database wordnet

PDF Slides

Learning to tag and tagging to learn: A case study on wikipedia

Peter Mika, Massimiliano Ciaramita, Hugo Zaragoza and Jordi Atserias

The problem of semantically annotating Wikipedia inspires a novel method for dealing with domain and task adaptation of semantic taggers in cases where parallel text and metadata are available.

IEEE Intelligent Systems, vol. 23, no. 5, pp. 26–33, 2008.

10.1109/MIS.2008.85

journal 2008

PDF

PoS Tagging with a Named Entity Tagger

M. Ciaramita and J. Atserias

in Proceedings of the final evalita 2007 workshop, 2007.

paper 2007 competition

PDF

Named Entity Tagging with a PoS Tagger

M. Ciaramita and J. Atserias

in Proceedings of the final evalita 2007 workshop, 2007.

paper 2007 competition

PDF

PoS Tagging with a Named Entity Tagger

M. Ciaramita and J. Atserias

Intelligenza Artificiale, special issue (EVALITA 2007), vol. 2, 2007.

journal 2007 competition

Named Entity Tagging with a PoS Tagger

M. Ciaramita and J. Atserias

Intelligenza Artificiale, special issue (EVALITA 2007), vol. 2, 2007.

journal 2007 competition

World knowledge in broad-coverage information filtering

B. A. Hagedorn, M. Ciaramita, and J. Atserias

in SIGIR 2007: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, amsterdam, the netherlands, july 23-27, 2007, 2007, pp. 801–802

10.1145/1277741.1277916

paper 2007

Ranking very many typed entities on wikipedia

Hugo Zaragoza, Henning Rode, Peter Mika, Jordi Atserias, Massimiliano Ciaramita, Giuseppe Attardi

We discuss the problem of ranking very many entities of different types. In particular we deal with a heterogeneous set of types, some being very generic and some very specific. We discuss two approaches for this problem: i) exploiting the entity containment graph and ii) using a Web search engine to compute entity relevance. We evaluate these approaches on the real task of ranking Wikipedia entities typed with a state-of-the-art named-entity tagger. Results show that both approaches can greatly increase the performance of methods based only on passage retrieval.

10.1145/1321440.1321599

paper 2007

PDF

Towards Robustness in Natural Language Understanding

Jordi Atserias

Most of the different tasks included in Natural Language Processing (nlp) (such as, Word Sense Disambiguation, Information Retrieval, Information Extraction, Question Answering, Information Filtering, Natural Language Interfaces, Story Understanding or Machine Translation) apply different levels of Natural Language Understanding (nlu). This thesis explores a new integrated architecture for robust nlu, exploiting constraint-based optimization techniques. The goal of this work is towards robust and flexible architectures able to deal with the complexity of advanced nlp. In particular, we present a novel architecture (pardon), orthogonal to the traditional nlp task decomposition, which applies any kind of knowledge (syntactic, semantic, linguistic, statistical) at the earliest opportunity while retaining an independent representation of the different kinds of knowledge. The different architectures proposed for nlu can be classified based on two main dimensions, namely, the level of integration of their processes and the level of integration of their data. An easier modularization aimed at focusing on a particular nlp task and competitions (e.g. MUC, TREC, etc) have lead most of the researchers to adopt a pipelined or stratified architecture. However, this architecture shows several drawbacks which has made us consider the use of integrated and interactive approaches. In order to implement such approaches, we will also introduce the Consistent Labeling Problems (clps), a specific case of Constraint Satisfaction Problems that can be solved eficiently by a set of iterative algorithms (e.g. relaxation labeling). Constraints allow us to integrate both processes and knowledge in the same framework. On the one hand, many forms of ambiguity can be represented in a compact and elegant manner, and processed eficiently by means of constraints. On the other hand, many nlp processes (e.g., many wsd techniques) could also be represented as constraints. Inside the pardon architecture, an object uses its models to combine itself with other objects. During this combination, some of its attribute values are determined (in a similar way to Hearst’s Polaroid Words [Hirst, 1987]). Roughly speaking, pardon combines objects from one level in order to build the objects corresponding to the next level of the task under consideration. This combination is carried out by using lexicalized models. That is, these models must be anchored-in/trigged by a first-level object. pardon represents the relationships between objects in a dependency-like style, with models and roles. In order to avoid the combinatorial explosion of possible object combinations, this framework is formalized as a Consistent Labeling Problem (clp). Thus, it can be solved using optimization methods (e.g. the relaxation labeling algorithm) to find the most consistent solution. pardon aims to give a general framework, that is multilingual and open domain, in which different nlp tasks can be easily formalized. These different tasks can be tested separately or carried out simultaneously following an integrated approach. Pursuing this goal, we have also integrated several resources in a multilingual knowledge base, named Multilingual Central Repository (mcr). mcr has been built aroundWordNet, using the EuroWordNet architecture. This multilingual repository integrates different resources, ontologies (sumo, Top Concept Ontology), thematic classifications (Domains), local wordnets of have diferent languages, and so on. The new architecture proposed by pardon has been successfully applied to two different nlu tasks involved in Semantic Interpretation, namely Semantic Role Labeling (srl) and Word Sense Disambiguation (wsd). Usually, Word Sense Disambiguation and Semantic Role Labeling are considered separately although they are strongly related. wsd can improve results in srl (as different senses have different syntactic behaviours, specially verbs) and vice-versa (e.g. using verbal preferences for wsd)."

tesis 2006 pardon semantic role labeling consistent labeling problem relaxation labeling word sense disambiguation srl wsd

PDF

FreeLing 1.3: Syntactic and semantic services in an open-source NLP library

J. Atserias, B. Casas, E. Comelles, M. González, L. Padró, and M. Padró

This paper describes version 1.3 of the FreeLing suite of NLP tools. FreeLing was first released in February 2004 providing morphological analysis and PoS tagging for Catalan, Spanish, and English. From then on, the package has been improved and enlarged to cover more languages (i.e. Italian and Galician) and offer more services: Named entity recognition and classification, chunking, dependency parsing, and WordNet based semantic annotation. FreeLing is not conceived as end-user oriented tool, but as library on top of which powerful NLP applications can be developed. Nevertheless, sample interface programs are provided, which can be straightforwardly used as fast, flexible, and efficient corpus processing tools. A remarkable feature of FreeLing is that it is distributed under a free-software LGPL license, thus enabling any developer to adapt the package to his needs in order to get the most suitable behaviour for the application being developed.

in Proceedings of the fifth international conference on language resources and evaluation, LREC 2006, genoa, italy, may 22-28, 2006, 2006, pp. 2281–2286

paper 2006 lrec freeling nlp tools corpus analysis tools morphological analysis pos tagging nerc named entity recognition and classification semantic annotation free software.

PDF

Multiwords and Word Sense Disambiguation

Victoria Arranz, Jordi Atserias and Mauro Castillo

This paper studies the impact of multiword expressions on Word Sense Disambiguation (WSD). Several identification strategies of the multiwords in WordNet2.0 are tested in a real Senseval-3 task: the disambiguation of WordNet glosses. Although we have focused on Word Sense Disambiguation, the same techniques could be applied in more complex tasks, such as Information Retrieval or Question Answering

in Computational linguistics and intelligent text processing, 6th international conference, cicling 2005, mexico city, mexico, february 13-19, 2005, proceedings, 2005, vol. 3406, pp. 250–262

10.1007/978-3-540-30586-6_28

paper 2005 wsd multiwords word sense disambiguation wordnet

PDF

Un Enfoque Integrado para la Desambiguación

Jordi Atserias

This paper presents an extension for WSD of an integrated arquitecture disigned for Semantic Parsing. In the proposed framework, both tasks could be adressed simultaneously, colaborating between them. The feasibility and robustness of the proposed arquitecture have been proved against a well-defined task on WSD (the SENSEVAL-II English Lexical Sample) using automatically acquired models.

in XXI congreso de la sociedad española para el procesamiento del lenguaje natural (sepln’05), 2005, pp. 179–186.

paper 2005 sepln wsd semantic parsing

PDF

Artificial intelligence and computer science

Jordi Atserias

S. Shannon, Ed. Nova Science Publisher Inc., 2005, pp. 177–196.

book 2005

TXALA un analizador libre de dependencias para el castellano

Jordi Atserias Batalla, Elisabet Comelles Pujadas and Aingeru Mayor

In this demo we present the first version of Txala, a dependency parser for Spanish developed under LGPL license. This parser is framed in the development of a free-software platform for Machine Translation. Due to the lack of this kind of syntactic parsers for Spanish, this tool is essential for the development of NLP in Spanish.

in XXI congreso de la sociedad española para el procesamiento del lenguaje natural (sepln'05), 2005, pp. 455–456.

journal paper 2005 sepln syntax parsing dependecy parser nlp tools

PDF

An Integrated Approach to Word Sense Disambiguation

J. Atserias, L. Padró, and G. Rigau

in Recent advances in natural language processing (ranlp’05), 2005, pp. 82–88.

paper 2005

A Proposal for a Shallow Ontologization of Wordnet

Salvador Climent, Jordi Atserias Batalla, Joaquim Moré López and German Rigau Claramunt

This paper presents the work carried out towards the so-called shallow ontologization of WordNet, which is argued to be a way to overcome most of the many structural problems of the widely used lexical knowledge base. The result shall be a multilingual resource more suitable for large-scale semantic processing.

Proces. del Leng. Natural, vol. 35, pp. 161–167, 2005

journal paper sepln 2005 wordnet ontologies

PDF

The MEANING Multilingual Central Repository

Jordi Atserias, Luís Villarejo, German Rigau, Eneko Agirre, John Carroll, Bernardo Magnini and Piek Vossen

This paper describes the first version of the Multilingual Central Repository, a lexical knowledge base developed in the framework of the MEANING project. Currently the MCR integrates into the EuroWordNet framework five local wordnets (including four versions of the English WordNet from Princeton), an upgraded version of the EuroWordNet Top Concept ontology, the MultiWordNet Domains, the Suggested Upper Merged Ontology (SUMO) and hundreds of thousand of new semantic relations and properties automatically acquired from corpora. We believe that the resulting MCR will be the largest and richest Multilingual Lexical Knowledge Base in existence.

in 2nd International global wordnet conference (gwc'04), 2004, pp. 23-30.

paper 2004

Cross-Language Acquisition of Semantic Models for Verbal Predicates

Jordi Atserias, Bernardo Magnini, Octavian Popescu, Eneko Agirre, Aitziber Atutxa, German Rigau, John Carroll, Rob Koeling

This paper presents a semantic-driven methodology for the automatic acquisition of verbal models. Our approach relies strongly on the semantic generalizations allowed by already existing resources (e.g. Domain labels, Named Entity categories, concepts in the SUMO ontology, etc). Several experiments have been carried out using comparable corpora in four languages (Italian, Spanish, Basque and English) and two domains (FINANCE and SPORT) showing that the semantic patterns acquired can be general enough to be ported from one language to the other language.

in 4th International conference on language resources and evaluation (lrec'04), 2004, pp. 33-36.

paper 2004 lrec multilingual knowledge acquisition

PDF

Towards the MEANING Top Ontology: Sources of Ontological Meaning

Jordi Atserias, Salvador Climent, German Rigau

This paper describes the initial research steps towards the Top Ontology for the Multilingual Central Repository (Mcr) built in the Meaning project. The current version of the Mcr integrates five local wordnets plus four versions of Princeton’s English WordNet, three ontologies and hundreds of thousands of new semantic relations and properties automatically acquired from corpora. In order to maintain compatibility among all these heterogeneous knowledge resources, it is fundamental to have a robust and advanced ontological support. This paper studies the mapping of main Sources of Ontological Meaning onto the wordnets and, in particular, the current work in mapping the EuroWordNet Top Concept Ontology.

in 4th International conference on language resources and evaluation (lrec’04), 2004, pp. 11–14.

paper 2004 lrec wordnet eurowordnet ontologies

PDF

Spanish WordNet 1.6: Porting the Spanish Wordnet across Princeton versions

Jordi Atserias, Luís Villarejo, German Rigau

This paper describes the new Spanish Wordnet aligned to Princeton WordNet1.6 and the analysis of the transformation from the previous version aligned to Princeton WordNet1.5. Although a mapping technology exists, to our knowledge it is the first time a whole local wordnet has been ported to a newer release of the Princeton WordNet.

in 4th International conference on language resources and evaluation (lrec’04), 2004, pp. 161–164.

paper 2004 lrec wordnet eurowordnet

PDF

The TALP Systems for Disambiguating WordNet Glosses

Mauro Castillo, Francis Real, Jordi Atserias and German Rigau

This paper presents a summary report on the empirical results obtained on the SENSEVAL-3 task 12 “Word-Sense Disambiguation of WordNet Glosses”. Our method combines a set of knowledge-based heuristics integrating several information resources and techniques. From the ten systems presented at the taks, our systems obtained the first and third positions.

paper 2004 competition senseval wordnet wsd

PDF

Automatic Acquisition of Sense Examples Using ExRetriever

M. Cuadros, J. Atserias, M. Castillo, and G. Rigau

A current research line for word sense disambiguation (WSD) focuses on the use of supervised machine learning techniques. One of the drawbacks of using such techniques is that previously sense annotated data is required. This paper presents ExRetriever, a new software tool for automatically acquiring large sets of sense tagged examples from large collections of text and the Web. ExRetriever exploits the knowledge contained in large-scale knowledge bases (e.g., WordNet) to build complex queries, each of them characterising particular senses of a word. These examples can be used as training instances for supervised WSD algorithms.

in IBERAMIA workshop on lexical resources and the web for word sense disambiguation, 2004, pp. 97–104.

paper 2004 wsd

PDF

Automatic Acquisition of Sense Examples Using ExRetriever

J. Fernández, M. Castillo, G. Rigau, J. Atserias, and J. Turmo

A current research line for word sense disambiguation (WSD) focuses on the use of supervised machine learning techniques. One of the drawbacks of using such techniques is that previously sense annotated data is required. This paper presents ExRetriever, a new software tool for automatically acquiring large sets of sense tagged examples from large collections of text and the Web. ExRetriever exploits the knowledge contained in large-scale knowledge bases (e.g., WordNet) to build complex queries, each of them characterising particular senses of a word. These examples can be used as training instances for supervised WSD algorithms.

in Proceedings of the fourth international conference on language resources and evaluation, LREC 2004, may 26-28, lisbon, portugal, 2004

paper 2004 acquisition WordNet WSD

PDF

Exploring large-scale Acquisition of Multilingual Semantic Models for Predicates

Jordi Atserias, Mauro Castillo, Francis Real, Horacio Rodríguez and Germán Rigau

paper journal sepln 2003

PDF

Starting up the Multilingual Central Repository

Jordi Atserias, German Rigau and Luís Villarejo

paper journal sepln 2003

PDF

Integrating and porting Knowleges across Languages

Jordi Atserias, German Rigau and Luís Villarejo

in Recent advances in natural language processing (ranlp'03), 2003, pp. 31-37. ISBN 954-90906-6-3

paper 2003

PDF

Starting up the Multilingual Central Repository

Jordi Atserias, German Rigau and Luís Villarejo

journal paper sepln 2003

PDF

The MEANING project

Germán Rigau, Eneko Agirre and Jordi Atserias

journal paper sepln 2003

PDF

First Release of the Multilingual Central Repository of MEANING

Luís Villarejo, Jordi Atserias, Gerard Escudero, and German Rigau

demo sepln 2003

PDF

Integrating Multiple Knowledge Sources for Robust Semantic Parsing

Jordi Atserias, Lluís Padró and Germán Rigau

This work explores a new robust approach for Semantic Parsing of unrestricted texts. Our approach considers Semantic Parsing as a Consistent Labelling Problem (clp), allowing the integration of several knowledge types (syntactic and semantic) obtained from different sources (linguistic and statistic). The current implementation obtains 95% accuracy in model identification and 72% in case-role filling

paper 2001

PDF

Combining Supervised and Unsupervised Lexical Knowledge Methods for Word Sense Disambiguation

E. Agirre, G. Rigau, L. Padró, and J. Atserias

Comput. Humanit., vol. 34, nos. 1-2, pp. 103-108, 2000

10.1023/A:1002486301447

paper 2000

PDF

Semantic Analysis based on Verbal Subcategorization

J. Atserias, I. Castellón, M. Civit, and G. Rigau

in Conference on intelligent text processing and computational linguistics (cicling'00), 2000, pp. 330-340.

paper 2000

PDF

Using Diathesis for Semantic Parsing

J. Atserias, I. Castellón, M. Civit, and G. Rigau

in Venecia per il trattamento automatico delle lingue (vextal'99), 1999, pp. 385-392.

paper 1999

PDF

Morphosyntactic Analysis and Parsing of Unrestricted Spanish Text

Jordi Atserias i Batalla, Josep Carmona Vargas, Irene Castellón Masalles, Sergi Cervell, Montserrat Civit Torruella, Lluís Màrquez, María Antonia Martí Antonín, Lluís Padró Cirera, Roberto Placer, Horacio Rodríguez Hontoria, Mariona Taulé Delor, Jordi Turmo

This online demonstration is about an environment for massive processing of unrestricted Spanish text. The system consists of three stages: morphological analysis, POS disambiguation and parsing. The output of each can be pipelined into the next. The first two phases are described in (Carmona et al., 1998) and the third is described in (Atserias et al., 1998), both published in this conference. The execution may be performed inside the GATE environment, which enables visualization and analysis of intermediate results, or either in background, if higher eeciency is required for massive text processing.

paper 1998 demo gate lrec

PDF

Syntactic Parsing of Unrestricted Spanish Text

I. Castellón, M. Civit, and J. Atserias

This research focusses on the syntactical parsing of morphologycal tagged corpora. A proposal for a corpus oriented Spanish grammar is presented in this document. This work has been developed in the framework of the ITEM project and its main goal is to provide multilingual background for information extraction and retrieval tasks. The main goal of Tacat analyser is to provide a way of obtaining large amounts of bracketed and parsed corpora, both general land specific domain. Tacat uses context free grammars and has as input following categories of Parole specification.The incremental methodology that we use allows us to recognise different levels of complexity in the analysis and to produce compatible outputs of all the grammars

in 1st International conference on language resources and evaluation (lrec'98), 1998, pp. 603-609.

paper 1998

PDF

Combining Multiple Methods for the Automatic Construction of Multilingual WordNets

Jordi Atserias, Salvador Climent, Xavier Farreres, German Rigau and Horacio Rodríguez

This paper explores the automatic construction of a multilingual Lexical Knowledge Base from preexisting lexical resources. First, a set of automatic and complementary techniques for linking Spanish words collected from monolingual and bilingual MRDs to English WordNet synsets are described. Second, we show how resulting data provided by each method is then combined to produce a preliminary version of a Spanish WordNet with an accuracy over 85%. The application of these combinations results on an increment of the extracted connexions of a 40% without losing accuracy. Both coarse-grained (class level) and fine-grained (synset assignment level) confidence ratios are used and evaluated. Finally, the results for the whole process are presented.

in Recent advances in natural language (ranlp'97), 1997, pp. 143-149

10.1075/cilt.189.32ats

paper 1997

PDF

Combining Unsupervised Lexical Knowledge Methods for Word Sense Disambiguation

German Rigau, Jordi Atserias, and Eneko Agirre

This paper presents a method to combine a set of unsupervised algorithms that can accurately disambiguate word senses in a large, completely untagged corpus. Although most of the techniques for word sense resolution have been presented as stand-alone, it is our belief that full-fledged lexical ambiguity resolution should combine several information sources and techniques. The set of techniques have been applied in a combined way to disambiguate the genus terms of two machine-readable dictionaries (MRD), enabling us to construct complete taxonomies for Spanish and French. Tested accuracy is above 80% overall and 95% for two-way ambiguous genus terms, showing that taxonomy building is not limited to structured dictionaries such as LDOCE.

in Joint 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics (ACL/EACL’97), 1997, pp. 48–55

10.3115/976909.979624

paper 1997 acl wsd

PDF

		Euskal Herriko Unibertsitatea, San Sebastian, Spain. 2006 PhD in Natural Language Processing (European) Publications Towards Robustness in Natural Language Understanding
		Universitat Politèctnica de Catalunya 1994 B.Sc. in Computer Science

Hi, I am Jordi