Author Image

Hi, I am Jordi

Jordi Atserias Batalla

Senior Software Engineer at Amazon

I am a passionate software engineer, interested in Machine Learning, Natural Language Processing, Information Retrieval, Big data and scalability. Always moving around the grey area between Software Engineer and Applied Scientist. I have been lucky to work for amazing companies (Larousse-VOX, Yahoo! Research, Trovit, Amazon) and Universities (UPC, UOC, UPF, EHU, IE Business School) without moving from my home town. Federated Chess Player since 1992 but last 20 years only playing one tournament per year. With COVID, In 2021 I played my first online tournament: the FIDE online World Corporate Chess Championship as part of Amazon EU Team. I am part of the old fart geeks ONG FriquiFund BCN (Friqui is slang for geek in Catalan and Spanish) an organisation to help young geeks, budding jedis, and white hat apprentices in need. I also have created and maintain the website of the holiness Ark nursery and primary community school which is a private non-profit making charitable organization to help disvantatged groups, particularly orphans, vulnerable child and elderly people driven by the local community of kihogo, kasenda perish, Uganda.


Senior Software Engineer

May 2017 - Present, Barcelona

Amazon strives to be Earth’s most customer-centric company, AWS, Kindle, Fire tablets, Fire TV, Amazon Echo, Alexa …

  • (2022) books (CX, widgets and related services, and improving search for books)
  • (2019) search relevance (fast experimentation framework, DL MLops platform).
  • (2017) core-ai / core-ml (brand annotators, behavioral features for ranking). As the most senior engineer in the group and bcn site, leading projects and engineering best practices.

Search and Big data engineer

May 2016 - May 2017, Barcelona

Search aggregator to find jobs, homes, cars, products (more than 44 countries)

  • Leading a small team to improve search relevance.
  • Rebump search engine (based on solr/lucene) and integrate it in k8s


2015 - 2016, Barcelona

CIE-9/CIE-10 ML assitant to annotate medical documents

  • drive the design and implementation of the assitant

Research Engineer
Yahoo! Labs / Fundació Barcelona Media

2015 - 2006, Barcelona

Web serch engine, email, verticals

  • Research and implementation of search prototypes, knowledge transfer to production teams
  • Participating in several European Funded projects
  • Coordinating several Spanish Goverment funded projects
  • Supervising several international master students (DMKM) and Ph d. students doing internships at yahoo!

Research Scientist
Universitat Oberta de Catalunya (UOC)

Jan 2006 - May 2006, Barcelona


  • Spanish RESTAD project on translation tools.

Universitat Politecnica de Catalunya (UPC)

1995 - 2005, Barcelona


PhD. student/Research Scientist

2002 - 2005

  • Coordination, Research and implementation related to MEANING project.
PhD. student/Research Scientist

1997 - 2000

  • Coordination, Research and implementation related to EuroWordnet project.
PhD. student/Research Scientist

1996 - 1997

  • Research and implementation related to ITEM project.
Part-time lecturer

1995 - 1997

  • teaching

Vivendi Universal (Spes/Vox/Larousse))

2000 - 2002, Barcelona

Publishing company

  • Consulting and Development of lexicographic tools.


B.Sc. in Computer Science


Contributor March 2018 - Present

VERTa addressed the evaluation of the MT from a linguistically-motivated point of view. VERTa is part of the research that intends to emphasize the effectiveness of linguistic analysis in order to identify and test those linguistic features that help in evaluating traditional concepts of adequacy and fluency. VERTA combines different modules: Lexical module, Morphological, Syntactic module, Ngram module, Semantic module and can be easily adapted to different evaluation types (fluency, adequacy, MT quality) and to different languages or genres.


European Union Project (FP3-ESPRIT 3) ESPRIT-731 Acquisition of lexical knowledge for natural language processing systems, semiautomatically from machine readable versions of conventional dictionaries (MRDs) for English, Spanish, Italian and Dutch.


Building a multilingual wordnet with semantic relations between words European Union Project (LRE TELEMATICS) LE-24003 Produced a rich and high quality coding of semantic relations and equivalence relations for a common set of about 5,000 base concepts in the four languages. 06/1996–03/1997 Researcher


Textual Information Retrieval in a multilingual environment using NL Techniques. Spanish National Project CICYT TIC96-1234-C03-02


Developing Multilingual Web-scale Language Technologies European Union Project (IST) IST- 2001-34460 03/2002–03/2005 Researcher-Group Coordination


European Union Project (ICT/2007.8.6: FET proactive 6) ICT-2009-231126 02/2009–02/2012 Research engineer-FBM Coordinator


La Enciclopedia Automática de personas y Organizaciones: Spanish National projet TIN2010-21128-C02-02 Subprograma de Proyectos de investigación Fundamental No Orientada Project website 01/2011-01/2014 Main Researcher-FBM Coordinator

Social Media

Métodos y Tecnologías para los Medios Sociales Spanish National Project CENIT CEN-20101037 2010- 2014 Research engineer-Researcher


ARchive COmmunities MEMories. From Collect-All Archives to Community Memories – Leveraging the Wisdom of the Crowds for Intelligent Preservation. Leverage the Wisdom of the Crowds for content appraisal, selection and preservation, so that archives reflect collective memory and social content perception. European Union Project FP7, FP7-IST-270239 Research engineer 2012. Research engineer-Researcher


Building mobile applications that approach human performance in conversational interaction. European Union Project FP7 FP7-ICT-287615 . 2012. Research engineer-Researcher


During my early years at UPC I developed an automata for generation/recognition of morphosyntactic word forms in CLOS and syntactic a CFG parser in C++ that ends up being part of Freeling.

Super Sense Tagger (Java)

SST is a C implementation of a SuperSense Tagger (HMM with average perceptron), JSST is a Java re-implementation


Sentence aligner

Solr Colored Index

An exercise to implement colored indexes in Solr. Colored Index allow smart searches combining text and annotations (e.g. coming from shallow nlp taggers).


Publication Profiles:
VERTa: a linguistic approach to automatic machine translation evaluation
Elisabet Comelles and Jordi Atserias

Machine translation (MT) is directly linked to its evaluation in order to both compare different MT system outputs and analyse system errors so that they can be addressed and corrected. As a consequence, MT evaluation has become increasingly important and popular in the last decade, leading to the development of MT evaluation metrics aiming at automatically assessing MT output. Most of these metrics use reference translations in order to compare system output, and the most well-known and widely spread work at lexical level. In this study we describe and present a linguistically-motivated metric, VERTa, which aims at using and combining a wide variety of linguistic features at lexical, morphological, syntactic and semantic level. Before designing and developing VERTa a qualitative linguistic analysis of data was performed so as to identify the linguistic phenomena that an MT metric must consider (Comelles et al. 2017). In the present study we introduce VERTa’s design and architecture and we report the experiments performed in order to develop the metric and to check the suitability and interaction of the linguistic information used. The experiments carried out go beyond traditional correlation scores and step towards a more qualitative approach based on linguistic analysis. Finally, in order to check the validity of the metric, an evaluation has been conducted comparing the metric’s performance to that of other well-known state-of-the-art MT metrics.

Lang Resources & Evaluation 53, 57–86 (2019)

Through the eyes of verta
Elisabet Comelles and Jordi Atserias

This paper describes a practical demo of VERTa for Spanish. VERTa is an MT evaluation metric that combines linguistic features at different levels. VERTa has been developed for English and Spanish but can be easily adapted to other languages. VERTa can be used to evaluate adequacy, fluency and ranking of sentences. In this paper, VERTa’s modules are described briefly, as well as its graphical interface which provides information on VERTa’s performance and possible MT errors.

In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 366–372, Lisbon, Portugal. Association for Computational Linguistics.016

VERTa: a Linguistically-motivated Metric at the WMT15 Metrics Task
Elisabet Comelles and Jordi Atserias

This paper describes VERTa’s submission to the 2015 EMNLP Workshop on Statistical Machine Translation. VERTa is a linguistically-motivated metric that combines linguistic features at different levels. In this paper, VERTa is described briefly, as well as the three versions submitted to the workshop: VERTa-70Adeq30Flu, VERTa-EQ and VERTa-W. Finally, the experiments conducted with the WMT14 data are reported and some conclusions are drawn.

in Proceedings of the tenth workshop on statistical machine translation, wmt@EMNLP 2015, 17-18 september 2015, lisbon, portugal, 2015, pp. 366–372

VERTa participation in the WMT14 Metrics Task
Elisabet Comelles and Jordi Atserias

We present VERTa, a linguistically-motivated metric that combines linguistic features at different levels. We provide the linguistic motivation on which the metric is based, as well as describe the different modules in VERTa and how they are combined. Finally, we describe the two versions of VERTa, VERTa-EQ and VERTa-W, sent to WMT14 and report results obtained in the experiments conducted with the WMT12 and WMT13 data into English.

in Proceedings of the ninth workshop on statistical machine translation, wmt@ACL 2014, june 26-27, 2014, baltimore, maryland, USA, pp. 368–375

VERTa: Facing a Multilingual Experience of a Linguistically-based MT Evaluation
Elisabet Comelles, Jordi Atserias, Victoria Arranz, Irene Castellon and Jordi Sesé

There are several MT metrics used to evaluate translation into Spanish, although most of them use partial or little linguistic information. In this paper we present the multilingual capability of VERTa, an automatic MT metric that combines linguistic information at lexical, morphological, syntactic and semantic level. In the experiments conducted we aim at identifying those linguistic features that prove the most effective to evaluate adequacy in Spanish segments. This linguistic information is tested both as independent modules (to observe what each type of feature provides) and in a combinatory fastion (where different kinds of information interact with each other). This allows us to extract the optimal combination. In addition we compare these linguistic features to those used in previous versions of VERTa aimed at evaluating adequacy for English segments. Finally, experiments show that VERTa can be easily adapted to other languages than English and that its collaborative approach correlates better with human judgements on adequacy than other well-known metrics.

in Proceedings of the ninth international conference on language resources and evaluation, LREC 2014, reykjavik, iceland, may 26-31, 2014, 2014, pp. 2701–2707

Using wikipedia for cross-language named entity recognition
E. R. Fernandes, U. Brefeld, R. Blanco, and J. Atserias,

Named entity recognition and classification (NERC) is fundamental for natural language processing tasks such as information extraction, question answering, and topic detection. State-of-the-art NERC systems are based on supervised machine learning and hence need to be trained on (manually) annotated corpora. However, annotated corpora hardly exist for non-standard languages and labeling additional data manually is tedious and costly. In this article, we present a novel method to automatically generate (partially) annotated corpora for NERC by exploiting the link structure of Wikipedia. Firstly, Wikipedia entries in the source language are labeled with the NERC tag set. Secondly, Wikipedia language links are exploited to propagate the annotations in the target language. Finally, mentions of the labeled entities in the target language are annotated with the respective tags. The procedure results in a partially annotated corpus that is likely to contain unannotated entities. To learn from such partially annotated data, we devise two simple extensions of hidden Markov models and structural perceptrons. Empirically, we observe that using the automatically generated data leads to more accurate prediction models than off-the-shelf NERC methods. We demonstrate that the novel extensions of HMMs and perceptrons effectively exploit the partially annotated data and outperforms their baseline counterparts in all settings.

5th international workshop on mining ubiquitous and social environments, muse 2014. In Big data analytics in the social and ubiquitous context - 5th international worksh 5th international workshop on mining ubiquitous and social environments, MUSE 2014, and first international workshop on machine learning for urban sensor data, senseml 2014, revised selected papers, 2014, vol. 9546, pp. 1–25

FBM: Combining lexicon-based ML and heuristics for Social Media Polarities
Carlos Rodríguez-Penagos, Jordi Atserias Batalla, Joan Codina-Filbà, David García-Narbona, Jens Grivolla, Patrik Lambert and Roser Saurí

This paper describes the system implemented by Fundació Barcelona Media (FBM) for classifying the polarity of opinion expressions in tweets and SMSs, and which is supported by a UIMA pipeline for rich linguistic and sentiment annotations. FBM participated in the SEMEVAL 2013 Task 2 on polarity classification. It ranked 5th in Task A (constrained track) using an ensemble system combining ML algorithms with dictionary-based heuristics, and 7th (Task B, constrained) using an SVM classifier with features derived from the linguistic annotations and some heuristics.

in Proceedings of the 7th international workshop on semantic evaluation, semeval@NAACL-hlt 2013, atlanta, georgia, usa, june 14-15, 2013, 2013, pp. 483–489

Spell Checking in Spanish: The Case of Diacritic Accents
Jordi Atserias, Maria Fuentes, Rogelio Nazar and Irene Renau

This article presents the problem of diacritic restoration (or diacritization) in the context of spell-checking, with the focus on an orthographically rich language such as Spanish. We argue that despite the large volume of work published on the topic of diacritization, currently available spell-checking tools have still not found a proper solution to the problem in those cases where both forms of a word are listed in the checker’s dictionary. This is the case, for instance, when a word form exists with and without diacritics, such as continuo ‘continuous’ and continuó ‘he/she/it continued’, or when different diacritics make other word distinctions, as in continúo ‘I continue’. We propose a very simple solution based on a word bigram model derived from correctly typed Spanish texts and evaluate the ability of this model to restore diacritics in artificial as well as real errors. The case of diacritics is only meant to be an example of the possible applications for this idea, yet we believe that the same method could be applied to other kinds of orthographic or even grammatical errors. Moreover, given that no explicit linguistic knowledge is required, the proposed model can be used with other languages provided that a large normative corpus is available.

in Proceedings of the eighth international conference on language resources and evaluation, LREC 2012, istanbul, turkey, may 23-25, 2012, 2012, pp. 737–742

FBM-Yahoo! at RepLab 2012
Jose M. Chenlo, Jordi Atserias, Carlos Rodriguez and Roi Blanco

This paper describes FBM-Yahoo!’s participation in the profiling task of RepLab 2012, which aims at determining whether a given tweet is related to a specific company and, in if this being the case, whether it contains a positive or negative statement related to the company’s reputation or not. We addressed both problems (ambiguity and polarity reputation) using Support Vector Machines (SVM) classifiers and lexicon-based techniques, building automatically company profiles and bootstrapping background data. Concretely, for the ambiguity task we employed a linear SVM classifier with a token-based representation of relevant and irrelevant information extracted from the tweets and Freebase resources. With respect to polarity classification, we combined SVM lexicon-based approaches with bootstrapping in order to determine the final polarity label of a tweet

in CLEF 2012 evaluation labs and workshop, online working notes, rome, italy, september 17-20, 2012, 2012, vol. 1178

VERTa: Linguistic features in MT evaluation
E. Comelles, J. Atserias, V. Arranz, and I. Castellón

In the last decades, a wide range of automatic metrics that use linguistic knowledge has been developed. Some of them are based on lexical information, such as METEOR; others rely on the use of syntax, either using constituent or dependency analysis; and others use semantic information, such as Named Entities and semantic roles. All these metrics work at a specific linguistic level, but some researchers have tried to combine linguistic information, either by combining several metrics following a machine-learning approach or focusing on the combination of a wide variety of metrics in a simple and straightforward way. However, little research has been conducted on how to combine linguistic features from a linguistic point of view. In this paper we present VERTa, a metric which aims at using and combining a wide variety of linguistic features at lexical, morphological, syntactic and semantic level. We provide a description of the metric and report some preliminary experiments which will help us to discuss the use and combination of certain linguistic features in order to improve the metric performance

in Proceedings of the eighth international conference on language resources and evaluation, LREC 2012, istanbul, turkey, may 23-25, 2012

Active Learning for Building a Corpus of Questions for Parsing
J. Atserias, G. Attardi, M. Simi, and H. Zaragoza

This paper describes how we built a dependency Treebank for questions. The questions for the Treebank were drawn from questions from the TREC 10 QA task and from Yahoo! Answers. Among the uses for the corpus is to train a dependency parser achieving good accuracy on parsing questions without hurting its overall accuracy. We also explore active learning techniques to determine the suitable size for a corpus of questions in order to achieve adequate accuracy while minimizing the annotation efforts.

in Proceedings of the international conference on language resources and evaluation, LREC 2010, 17-23 may 2010, valletta, malta, 2010, vol. 800, pp. 9–080

Automatic Annotation of the Catalan Wikipedia: Exploring the Semantic Space via multiple NERC systems
Jordi Atserias, Judith Domingo, Carlos Rodriguez, Teresa Suñol

This paper presents WikiNer, a snapshot of the Catalan Wikipedia processed with different NLP tools (POS tagger, NERC, dependency parsers). The article focuses on the analysis of different NERC annotations using 3 taggers: JNET, YamCha and SST. Although Wikipedia text (specially in tables, lists, references) differs significantly in distributional properties from the corpora used to train the taggers, we believe that results of automatically annotating the semantic space of the Catalan Wikipedia point to the quick availability of a resource containing massive text annotated with a degree of reliability that is enough for some research tasks as well as for applications, such as simple Q&A, ontology enrichment and semantic search

Proces. del Leng. Natural, vol. 45, pp. 169–173, 2010

Annotated Search and Element Retrieval
Hugo Zaragoza, Michael Matthews, Roi Blanco, and Jordi Atserias

Despite the great interest in different forms of textual annotation (named entity extraction, semantic tagging, syntactic and semantic parsing, etc.), there is still no consensus about which search tasks can be improved with such annotations, and what search algorithms are required to implement e cient engines to solve these tasks. We de ne formally two retrieval tasks in annotated collections: annotated retrieval and element retrieval. We discuss their differences and describe effcient indexing structures, and how they can be implemented in Lucene and MG4J, two open source retrieval engines. Finally, we give a technical overview of two element retrieval use cases.

in Proceedings of first international workshop on living web, collocated with the 8th international semantic web conference (iswc-2009), washington, dc, usa, october 26, 2009, vol. 515

Semantically Annotated Snapshot of the English Wikipedia
J. Atserias, H. Zaragoza, M. Ciaramita, and G. Attardi

This paper describes SW1, the first version of a semantically annotated snapshot of the English Wikipedia. In recent years Wikipedia has become a valuable resource for both the Natural Language Processing (NLP) community and the Information Retrieval (IR) community. Although NLP technology for processing Wikipedia already exists, not all researchers and developers have the computational resources to process such a volume of information. Moreover, the use of different versions of Wikipedia processed differently might make it difficult to compare results. The aim of this work is to provide easy access to syntactic and semantic annotations for researchers of both NLP and IR communities by building a reference corpus to homogenize experiments and make results comparable. These resources, a semantically annotated corpus and a ’entity containment’ graph.

Complete and Consistent Annotation of WordNet using the Top Concept Ontology
Javier Álvez, Jordi Atserias, Jordi Carrera, Salvador Climent, Egoitz Laparra, Antoni Oliver and German Rigau

This paper presents the complete and consistent ontological annotation of the nominal part of WordNet. The annotation has been carried out using the semantic features defined in the EuroWordNet Top Concept Ontology and made available to the NLP community. Up to now only an initial core set of 1,024 synsets, the so-called Base Concepts, was ontologized in such a way. The work has been achieved by following a methodology based on an iterative and incremental expansion of the initial labeling through the hierarchy while setting inheritance blockage points. Since this labeling has been set on the EuroWordNet’s Interlingual Index (ILI), it can be also used to populate any other wordnet linked to it through a simple porting process. This feature-annotated WordNet is intended to be useful for a large number of semantic NLP tasks and for testing for the first time componential analysis on real environments. Moreover, the quantitative analysis of the work shows that more than 40% of the nominal part of WordNet is involved in structure errors or inadequacies.

in Proceedings of the international conference on language resources and evaluation, LREC 2008, 26 may - 1 june 2008, marrakech, morocco, 2008

Learning to tag and tagging to learn: A case study on wikipedia
Peter Mika, Massimiliano Ciaramita, Hugo Zaragoza and Jordi Atserias

The problem of semantically annotating Wikipedia inspires a novel method for dealing with domain and task adaptation of semantic taggers in cases where parallel text and metadata are available.

IEEE Intelligent Systems, vol. 23, no. 5, pp. 26–33, 2008.

PoS Tagging with a Named Entity Tagger
M. Ciaramita and J. Atserias

in Proceedings of the final evalita 2007 workshop, 2007.

Named Entity Tagging with a PoS Tagger
M. Ciaramita and J. Atserias

in Proceedings of the final evalita 2007 workshop, 2007.

PoS Tagging with a Named Entity Tagger
M. Ciaramita and J. Atserias

Intelligenza Artificiale, special issue (EVALITA 2007), vol. 2, 2007.

Named Entity Tagging with a PoS Tagger
M. Ciaramita and J. Atserias

Intelligenza Artificiale, special issue (EVALITA 2007), vol. 2, 2007.

World knowledge in broad-coverage information filtering
B. A. Hagedorn, M. Ciaramita, and J. Atserias

in SIGIR 2007: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, amsterdam, the netherlands, july 23-27, 2007, 2007, pp. 801–802

Ranking very many typed entities on wikipedia
Hugo Zaragoza, Henning Rode, Peter Mika, Jordi Atserias, Massimiliano Ciaramita, Giuseppe Attardi

We discuss the problem of ranking very many entities of different types. In particular we deal with a heterogeneous set of types, some being very generic and some very specific. We discuss two approaches for this problem: i) exploiting the entity containment graph and ii) using a Web search engine to compute entity relevance. We evaluate these approaches on the real task of ranking Wikipedia entities typed with a state-of-the-art named-entity tagger. Results show that both approaches can greatly increase the performance of methods based only on passage retrieval.

Towards Robustness in Natural Language Understanding
Jordi Atserias

Most of the different tasks included in Natural Language Processing (nlp) (such as, Word Sense Disambiguation, Information Retrieval, Information Extraction, Question Answering, Information Filtering, Natural Language Interfaces, Story Understanding or Machine Translation) apply different levels of Natural Language Understanding (nlu). This thesis explores a new integrated architecture for robust nlu, exploiting constraint-based optimization techniques. The goal of this work is towards robust and flexible architectures able to deal with the complexity of advanced nlp. In particular, we present a novel architecture (pardon), orthogonal to the traditional nlp task decomposition, which applies any kind of knowledge (syntactic, semantic, linguistic, statistical) at the earliest opportunity while retaining an independent representation of the different kinds of knowledge. The different architectures proposed for nlu can be classified based on two main dimensions, namely, the level of integration of their processes and the level of integration of their data. An easier modularization aimed at focusing on a particular nlp task and competitions (e.g. MUC, TREC, etc) have lead most of the researchers to adopt a pipelined or stratified architecture. However, this architecture shows several drawbacks which has made us consider the use of integrated and interactive approaches. In order to implement such approaches, we will also introduce the Consistent Labeling Problems (clps), a specific case of Constraint Satisfaction Problems that can be solved eficiently by a set of iterative algorithms (e.g. relaxation labeling). Constraints allow us to integrate both processes and knowledge in the same framework. On the one hand, many forms of ambiguity can be represented in a compact and elegant manner, and processed eficiently by means of constraints. On the other hand, many nlp processes (e.g., many wsd techniques) could also be represented as constraints. Inside the pardon architecture, an object uses its models to combine itself with other objects. During this combination, some of its attribute values are determined (in a similar way to Hearst’s Polaroid Words [Hirst, 1987]). Roughly speaking, pardon combines objects from one level in order to build the objects corresponding to the next level of the task under consideration. This combination is carried out by using lexicalized models. That is, these models must be anchored-in/trigged by a first-level object. pardon represents the relationships between objects in a dependency-like style, with models and roles. In order to avoid the combinatorial explosion of possible object combinations, this framework is formalized as a Consistent Labeling Problem (clp). Thus, it can be solved using optimization methods (e.g. the relaxation labeling algorithm) to find the most consistent solution. pardon aims to give a general framework, that is multilingual and open domain, in which different nlp tasks can be easily formalized. These different tasks can be tested separately or carried out simultaneously following an integrated approach. Pursuing this goal, we have also integrated several resources in a multilingual knowledge base, named Multilingual Central Repository (mcr). mcr has been built aroundWordNet, using the EuroWordNet architecture. This multilingual repository integrates different resources, ontologies (sumo, Top Concept Ontology), thematic classifications (Domains), local wordnets of have diferent languages, and so on. The new architecture proposed by pardon has been successfully applied to two different nlu tasks involved in Semantic Interpretation, namely Semantic Role Labeling (srl) and Word Sense Disambiguation (wsd). Usually, Word Sense Disambiguation and Semantic Role Labeling are considered separately although they are strongly related. wsd can improve results in srl (as different senses have different syntactic behaviours, specially verbs) and vice-versa (e.g. using verbal preferences for wsd)."

FreeLing 1.3: Syntactic and semantic services in an open-source NLP library
J. Atserias, B. Casas, E. Comelles, M. González, L. Padró, and M. Padró

This paper describes version 1.3 of the FreeLing suite of NLP tools. FreeLing was first released in February 2004 providing morphological analysis and PoS tagging for Catalan, Spanish, and English. From then on, the package has been improved and enlarged to cover more languages (i.e. Italian and Galician) and offer more services: Named entity recognition and classification, chunking, dependency parsing, and WordNet based semantic annotation. FreeLing is not conceived as end-user oriented tool, but as library on top of which powerful NLP applications can be developed. Nevertheless, sample interface programs are provided, which can be straightforwardly used as fast, flexible, and efficient corpus processing tools. A remarkable feature of FreeLing is that it is distributed under a free-software LGPL license, thus enabling any developer to adapt the package to his needs in order to get the most suitable behaviour for the application being developed.

in Proceedings of the fifth international conference on language resources and evaluation, LREC 2006, genoa, italy, may 22-28, 2006, 2006, pp. 2281–2286

Multiwords and Word Sense Disambiguation
Victoria Arranz, Jordi Atserias and Mauro Castillo

This paper studies the impact of multiword expressions on Word Sense Disambiguation (WSD). Several identification strategies of the multiwords in WordNet2.0 are tested in a real Senseval-3 task: the disambiguation of WordNet glosses. Although we have focused on Word Sense Disambiguation, the same techniques could be applied in more complex tasks, such as Information Retrieval or Question Answering

in Computational linguistics and intelligent text processing, 6th international conference, cicling 2005, mexico city, mexico, february 13-19, 2005, proceedings, 2005, vol. 3406, pp. 250–262

Un Enfoque Integrado para la Desambiguación
Jordi Atserias

This paper presents an extension for WSD of an integrated arquitecture disigned for Semantic Parsing. In the proposed framework, both tasks could be adressed simultaneously, colaborating between them. The feasibility and robustness of the proposed arquitecture have been proved against a well-defined task on WSD (the SENSEVAL-II English Lexical Sample) using automatically acquired models.

in XXI congreso de la sociedad española para el procesamiento del lenguaje natural (sepln’05), 2005, pp. 179–186.

Artificial intelligence and computer science
Jordi Atserias

S. Shannon, Ed. Nova Science Publisher Inc., 2005, pp. 177–196.

TXALA un analizador libre de dependencias para el castellano
Jordi Atserias Batalla, Elisabet Comelles Pujadas and Aingeru Mayor

In this demo we present the first version of Txala, a dependency parser for Spanish developed under LGPL license. This parser is framed in the development of a free-software platform for Machine Translation. Due to the lack of this kind of syntactic parsers for Spanish, this tool is essential for the development of NLP in Spanish.

in XXI congreso de la sociedad española para el procesamiento del lenguaje natural (sepln'05), 2005, pp. 455–456.

An Integrated Approach to Word Sense Disambiguation
J. Atserias, L. Padró, and G. Rigau

in Recent advances in natural language processing (ranlp’05), 2005, pp. 82–88.

A Proposal for a Shallow Ontologization of Wordnet
Salvador Climent, Jordi Atserias Batalla, Joaquim Moré López and German Rigau Claramunt

This paper presents the work carried out towards the so-called shallow ontologization of WordNet, which is argued to be a way to overcome most of the many structural problems of the widely used lexical knowledge base. The result shall be a multilingual resource more suitable for large-scale semantic processing.

Proces. del Leng. Natural, vol. 35, pp. 161–167, 2005

The MEANING Multilingual Central Repository
Jordi Atserias, Luís Villarejo, German Rigau, Eneko Agirre, John Carroll, Bernardo Magnini and Piek Vossen

This paper describes the first version of the Multilingual Central Repository, a lexical knowledge base developed in the framework of the MEANING project. Currently the MCR integrates into the EuroWordNet framework five local wordnets (including four versions of the English WordNet from Princeton), an upgraded version of the EuroWordNet Top Concept ontology, the MultiWordNet Domains, the Suggested Upper Merged Ontology (SUMO) and hundreds of thousand of new semantic relations and properties automatically acquired from corpora. We believe that the resulting MCR will be the largest and richest Multilingual Lexical Knowledge Base in existence.

in 2nd International global wordnet conference (gwc'04), 2004, pp. 23-30.

Cross-Language Acquisition of Semantic Models for Verbal Predicates
Jordi Atserias, Bernardo Magnini, Octavian Popescu, Eneko Agirre, Aitziber Atutxa, German Rigau, John Carroll, Rob Koeling

This paper presents a semantic-driven methodology for the automatic acquisition of verbal models. Our approach relies strongly on the semantic generalizations allowed by already existing resources (e.g. Domain labels, Named Entity categories, concepts in the SUMO ontology, etc). Several experiments have been carried out using comparable corpora in four languages (Italian, Spanish, Basque and English) and two domains (FINANCE and SPORT) showing that the semantic patterns acquired can be general enough to be ported from one language to the other language.

in 4th International conference on language resources and evaluation (lrec'04), 2004, pp. 33-36.

Towards the MEANING Top Ontology: Sources of Ontological Meaning
Jordi Atserias, Salvador Climent, German Rigau

This paper describes the initial research steps towards the Top Ontology for the Multilingual Central Repository (Mcr) built in the Meaning project. The current version of the Mcr integrates five local wordnets plus four versions of Princeton’s English WordNet, three ontologies and hundreds of thousands of new semantic relations and properties automatically acquired from corpora. In order to maintain compatibility among all these heterogeneous knowledge resources, it is fundamental to have a robust and advanced ontological support. This paper studies the mapping of main Sources of Ontological Meaning onto the wordnets and, in particular, the current work in mapping the EuroWordNet Top Concept Ontology.

in 4th International conference on language resources and evaluation (lrec’04), 2004, pp. 11–14.

Spanish WordNet 1.6: Porting the Spanish Wordnet across Princeton versions
Jordi Atserias, Luís Villarejo, German Rigau

This paper describes the new Spanish Wordnet aligned to Princeton WordNet1.6 and the analysis of the transformation from the previous version aligned to Princeton WordNet1.5. Although a mapping technology exists, to our knowledge it is the first time a whole local wordnet has been ported to a newer release of the Princeton WordNet.

in 4th International conference on language resources and evaluation (lrec’04), 2004, pp. 161–164.

The TALP Systems for Disambiguating WordNet Glosses
Mauro Castillo, Francis Real, Jordi Atserias and German Rigau

This paper presents a summary report on the empirical results obtained on the SENSEVAL-3 task 12 “Word-Sense Disambiguation of WordNet Glosses”. Our method combines a set of knowledge-based heuristics integrating several information resources and techniques. From the ten systems presented at the taks, our systems obtained the first and third positions.

Automatic Acquisition of Sense Examples Using ExRetriever
M. Cuadros, J. Atserias, M. Castillo, and G. Rigau

A current research line for word sense disambiguation (WSD) focuses on the use of supervised machine learning techniques. One of the drawbacks of using such techniques is that previously sense annotated data is required. This paper presents ExRetriever, a new software tool for automatically acquiring large sets of sense tagged examples from large collections of text and the Web. ExRetriever exploits the knowledge contained in large-scale knowledge bases (e.g., WordNet) to build complex queries, each of them characterising particular senses of a word. These examples can be used as training instances for supervised WSD algorithms.

in IBERAMIA workshop on lexical resources and the web for word sense disambiguation, 2004, pp. 97–104.

Automatic Acquisition of Sense Examples Using ExRetriever
J. Fernández, M. Castillo, G. Rigau, J. Atserias, and J. Turmo

A current research line for word sense disambiguation (WSD) focuses on the use of supervised machine learning techniques. One of the drawbacks of using such techniques is that previously sense annotated data is required. This paper presents ExRetriever, a new software tool for automatically acquiring large sets of sense tagged examples from large collections of text and the Web. ExRetriever exploits the knowledge contained in large-scale knowledge bases (e.g., WordNet) to build complex queries, each of them characterising particular senses of a word. These examples can be used as training instances for supervised WSD algorithms.

in Proceedings of the fourth international conference on language resources and evaluation, LREC 2004, may 26-28, lisbon, portugal, 2004

Integrating and porting Knowleges across Languages
Jordi Atserias, German Rigau and Luís Villarejo

in Recent advances in natural language processing (ranlp'03), 2003, pp. 31-37. ISBN 954-90906-6-3

Integrating Multiple Knowledge Sources for Robust Semantic Parsing
Jordi Atserias, Lluís Padró and Germán Rigau

This work explores a new robust approach for Semantic Parsing of unrestricted texts. Our approach considers Semantic Parsing as a Consistent Labelling Problem (clp), allowing the integration of several knowledge types (syntactic and semantic) obtained from different sources (linguistic and statistic). The current implementation obtains 95% accuracy in model identification and 72% in case-role filling

Semantic Analysis based on Verbal Subcategorization
J. Atserias, I. Castellón, M. Civit, and G. Rigau

in Conference on intelligent text processing and computational linguistics (cicling'00), 2000, pp. 330-340.

Using Diathesis for Semantic Parsing
J. Atserias, I. Castellón, M. Civit, and G. Rigau

in Venecia per il trattamento automatico delle lingue (vextal'99), 1999, pp. 385-392.

Morphosyntactic Analysis and Parsing of Unrestricted Spanish Text
Jordi Atserias i Batalla, Josep Carmona Vargas, Irene Castellón Masalles, Sergi Cervell, Montserrat Civit Torruella, Lluís Màrquez, María Antonia Martí Antonín, Lluís Padró Cirera, Roberto Placer, Horacio Rodríguez Hontoria, Mariona Taulé Delor, Jordi Turmo

This online demonstration is about an environment for massive processing of unrestricted Spanish text. The system consists of three stages: morphological analysis, POS disambiguation and parsing. The output of each can be pipelined into the next. The first two phases are described in (Carmona et al., 1998) and the third is described in (Atserias et al., 1998), both published in this conference. The execution may be performed inside the GATE environment, which enables visualization and analysis of intermediate results, or either in background, if higher eeciency is required for massive text processing.

Syntactic Parsing of Unrestricted Spanish Text
I. Castellón, M. Civit, and J. Atserias

This research focusses on the syntactical parsing of morphologycal tagged corpora. A proposal for a corpus oriented Spanish grammar is presented in this document. This work has been developed in the framework of the ITEM project and its main goal is to provide multilingual background for information extraction and retrieval tasks. The main goal of Tacat analyser is to provide a way of obtaining large amounts of bracketed and parsed corpora, both general land specific domain. Tacat uses context free grammars and has as input following categories of Parole specification.The incremental methodology that we use allows us to recognise different levels of complexity in the analysis and to produce compatible outputs of all the grammars

in 1st International conference on language resources and evaluation (lrec'98), 1998, pp. 603-609.

Combining Multiple Methods for the Automatic Construction of Multilingual WordNets
Jordi Atserias, Salvador Climent, Xavier Farreres, German Rigau and Horacio Rodríguez

This paper explores the automatic construction of a multilingual Lexical Knowledge Base from preexisting lexical resources. First, a set of automatic and complementary techniques for linking Spanish words collected from monolingual and bilingual MRDs to English WordNet synsets are described. Second, we show how resulting data provided by each method is then combined to produce a preliminary version of a Spanish WordNet with an accuracy over 85%. The application of these combinations results on an increment of the extracted connexions of a 40% without losing accuracy. Both coarse-grained (class level) and fine-grained (synset assignment level) confidence ratios are used and evaluated. Finally, the results for the whole process are presented.

in Recent advances in natural language (ranlp'97), 1997, pp. 143-149

Combining Unsupervised Lexical Knowledge Methods for Word Sense Disambiguation
German Rigau, Jordi Atserias, and Eneko Agirre

This paper presents a method to combine a set of unsupervised algorithms that can accurately disambiguate word senses in a large, completely untagged corpus. Although most of the techniques for word sense resolution have been presented as stand-alone, it is our belief that full-fledged lexical ambiguity resolution should combine several information sources and techniques. The set of techniques have been applied in a combined way to disambiguate the genus terms of two machine-readable dictionaries (MRD), enabling us to construct complete taxonomies for Spanish and French. Tested accuracy is above 80% overall and 95% for two-way ambiguous genus terms, showing that taxonomy building is not limited to structured dictionaries such as LDOCE.

in Joint 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics (ACL/EACL’97), 1997, pp. 48–55

Talks & Courses

Natural Language Processing and Text mining

Master in Business Analytics and Big Data. IE Business school. Jan-March 2015

Applications on Language Technologies Erasmus Mundus Language and Communication Technologies

Master (EM LCT) Erasmus Mundus Language and Communication Technologies

Estrategies i recursos per guiar els treballs finals de Grau i de Master

UNI2013437. Formació Permanent per la Professorat de la UB, Institut de Ciencies de l’Educació

Estrategies i recursos per guiar els treballs finals de Grau i de Master

Scaling Up Natural Language Processing, Nov, 2012

NLP research at Yahoo! Barcelona
with Mike Matthews

Kyoto project 2nd Workshop, 2011

Natural Language, Named Entities and Social Media

1st Workshop of OpeNer, EU project, Sep, 2012, University of the Basque country (EHU)

Natural Language, Named Entities and Social Media

Natural Language and information Retrieval, Erasmus Programme - Business Staff Mobility, University of Pisa, June, 2012

UIMA, NLP environments and libraries

In recent years have appeared different environments (GATE Nooj, NLTK, UIMA) and libraries (openNLP, Freeling, Tanla) that allow to develop PLN complex modules and they can be integrate in applications easily. In this tutorial we will analyze the advantages and properties of some of these tools (GATE, UIMA, openNLP, Freeling,.) and then we focus in more depth in the analysis of UIMA. UIMA (Unstructured Information Management Architecture) is a modular and flexible structure capable of analyzing large volumes of unstructured information. Beyond the semantic search engine it already has, UIMA can use and explore other alternatives for semantic indexing (eg, Lucene, MG4J), and the easy construction of end applications (eg REST services or consumers RDF CAS).

Yahoo! research, Natural Language Retrieval Group

Seminar, Tractament Automàtic del Llenguatge (2009) GRIAL


EscoLab obre la porta dels laboratoris i centres de recerca més capdavanters del país i ofereix l’oportunitat de dialogar amb el personal investigador que treballa en l’avenç de la societat. Escolab ofereix 10.000 places d’activitats científiques gratuïtes per a l’alumnat d’ESO, de batxillerat i de cicles formatius

Festa de la ciencia

Festa de la ciencia (2009): La Festa de la Ciència proposa un viatgea través dels temps, des de l’origen de l’univers i la formació dels astres a l’aparició de la vida a la Terra. Un viatge que ret homenatge aCharles Darwin, Galileu Galilei i Narcís Monturiol a través d’exposicions, itineraris, jocs i espectacles.