2012 Luxembourg Workshop Report
 

MULTILINGUALWEB

Standards and best practices for the Multilingual Web

World Wide Web Consortium (W3C)   Directorate-General for Translation (DGT) of the European Commission

W3C Workshop Report:
The Multilingual Web – The Way Ahead
15 - 16 March 2012, Luxembourg

Today, the World Wide Web is fundamental to communication in all walks of life. As the share of English web pages decreases and that of other languages increases, it is vitally important to ensure the multilingual success of the World Wide Web.

alt

The MultilingualWeb initiative is looking at best practices and standards related to all aspects of creating, localizing and deploying the Web multilingually. The project aims to raise the visibility of existing best practices and standards and identify gaps. The core vehicle for this is a series of four events which are planned over a two year period.

On 15-16 March 2012 the W3C ran the fourth workshop in the series, in Luxembourg, entitled "The Multilingual Web – The Way Ahead". The Luxembourg workshop was hosted by the Directorate-General for Translation (DGT) of the European Commission. Piet Verleysen, European Commission - Resources Director, responsible for IT at the Directorate-General for Translation, gave a brief welcome address.

As for the previous workshops, the aim of this workshop was to survey, and introduce people to, currently available best practices and standards that are aimed at helping content creators, localizers, tools developers, and others meet the challenges of the multilingual Web. The key objective was to share information about existing initiatives and begin to identify gaps.

Like the workshop in Limerick, this event ran for one and a half days, and the final half day was dedicated to an Open Space discussion forum, in breakout sessions. Participants pooled ideas for discussion groups at the beginning of the morning, split into 7 breakout areas and reported back in a plenary session at the end of the morning. Participants could join whichever group they found interesting, and could switch groups at any point. During the reporting session participants in other groups could ask questions of or make comments about the findings of the group. This, once more, proved to be a popular part of the workshop. The final attendance count for the event was a little over 130.

As for previous workshops, we video-recorded the presenters and with the assistance of VideoLectures, made the video available on the Web. We were unable to stream the content live over the Web. We also once more made available live IRC scribing to help people follow the workshop remotely, and assist participants in the workshop itself. As before, people were tweeting about the conference and the speakers during the event, and you can see these linked from the program page.

The program and attendees continued to reflect the same wide range of interests and subject areas as in previous workshops and we once again had good representation from industry (content and localization related) as well as research.

In what follows, after a short summary of key highlights and recommendations, you will find a short summary of each talk accompanied by a selection of key messages in bulleted list form. Links are also provided to the IRC transcript (taken by scribes during the meeting), video recordings of the talk (where available), and the talk slides. All talks lasted 15 minutes. Finally, there are summaries of the breakout session findings, most of which are provided by the participants themselves. It is strongly recommended to watch the videos, where available, since these are short but carry much more detail.

Contents: SummaryWelcomeDevelopersCreatorsLocalizersMachinesUsersBreakouts

Summary

Workshop sponsors

MultilingualWeb-LT logo

Video hosting

VideoLectures logo

What follows is an analysis and synthesis of ideas brought out during the workshop. It is very high level, and you should watch or follow the individual speakers talks to get a better understanding of the points made.

Our keynote speaker, Ivan Herman, described in a very easy to understand way the various technologies involved in the Semantic Web and the kinds of use cases that they address. Although it was easy to understand, it was very comprehensive and gave a good idea of the current status of the various technologies. He then posited some areas where the Semantic Web and the Multilingual Web communities could benefit each other. The Semantic Web has powerful technologies for categorizing knowledge, describing concepts, and interlinking information in different languages. This may help in binding information and translating, etc. The Multilingual Web community can offer advice on things such as how to describe the language of a literal or resource, how to refer to concepts across multilingual instantiations, and how to conceptualise the world across different cultures. It can also help to address the problems of Internationalized Resource Identifier equivalence.

In the Developers session we heard from Jan Anders Nelson about the importance of apps on Windows8, and saw a demo of tools that Microsoft is making available to app developers that support localization into an impressive number of languages. We also heard that Microsoft is paying greater attention these days to linguistic sub-markets, such as Spanish in the USA. Tony Graham introduced attendees to XSL-FO in a very entertaining way, and went on to make the case that community support is needed to uncover the requirements for text layout and typographic support on the Web if XSL-FO is to support users around the world. He raised the idea of setting up a W3C Community Group to address this, and called on people around the world to take action.

From Richard Ishida and Jirka Kosek we heard about various useful new features for multilingual text support being implemented or discussed in HTML5. These features are still being developed and participants are encouraged to participate in reviewing and discussing them. They include better support for bidirectional text in right-to-left languages, local annotation methods for Far-Eastern scripts (ruby), and the recent addition of a translate attribute, which allows you to specify what text should or should not be translated.

During the Creators session, Brian Teeman showed how the Joomla content management system has been improved to better support either translation or content adaptation. Joomla is a widely used authoring tool, and is largely supported by volunteer effort.

Loïc Dufresne de Virel talked about some of the problems Intel faced, and how they addressed those, and Gerard Meijssen discussed some of the issues involved in supporting the huge world of content in Wikipedia in over 300 languages. There are significant problems related to the availability of fonts and input mechanisms for many languages, but Gerard made a special point of improvements they need to CLDR's locale-specific data. Some issues relate to improving the CLDR interface, but like earlier speakers Gerard also called for wider participation from the public to supply needed information.

The Localizers session began with an an overview of the new MT@EC translation system that the European Commission is working on, from Spyridon Pilos. In terms of standards gaps, he called for standard data models, and structures for data storage and publication, so that less time is wasted on conversions. Matjaž Horvat then demonstrated a tool called Pontoon that is under development at Mozilla and that allows translators to translate Web pages in situ.

The session ended with a talk from Dave Lewis about the MultilingualWeb-LT project, that started recently under the aegis of a W3C Working Group. He showed examples of metadata that the group will be addressing, that ranges across creation, localization, consumption of content and language technologies and resources. The group aims to produce a new version of the ITS standard that is relevant not only to XML, but to HTML5 and content management systems. He called for public participation refining the requirements for the project.

In the Machines session we had overviews of best practices in the European Publications Office, and the Monnet and Enrycher projects.

Peter Schmitz showed how and why the Publications Office, which deals with a huge quantity of content for the European Union, has based its approach on standards, but has also been involved in developing standard approaches to various aspects of the work, such as Core metadata (restricted shared set of metadata for each resource based on Dublin Core, to enable global search), Common authority tables (harmonize metadata), Exchange protocol for EU legislative procedures (interoperability), and the European Legislative Identifier (ELI) (for interoperability). .

Paul Buitelaar packed into his presentation an impressive amount of information about the Monnet project and some of the issues they are grappling with related to multilingual use of the Semantic Web. They are exploring various research topics related to ontologies and domain semantics.

Tadej Štajner discussed the approaches used by the Enrycher project to identify named entities in content (such as place names), so that they can be handled appropriately during translation. Future work will need to address re-use of language and semantic resources to improve performance on NLP tasks across different languages, and lower the barriers for using this technology for enriching content within a CMS.

The Users session began with a talk from Annette Marino and Ad Vermijs from the European Commission's translation organization about how serious the Commission' takes its commitment to a multilingual web presence.

Murhaf Hossari then followed with description of a number of issues for people writing in right-to-left scripts where the Unicode Bidirectional Algorithm needs help. He called for changes to the algorithm itself to support these needs, rather than additional control characters or markup.

Nicoletta Calzolari introduced the new Multilingual Language Library, an LREC initiative to gather together all the linguistic knowledge the field is able to produce. The aim is to assemble a massive amounts of multi-dimensional data is the key to foster advancement in our knowledge about language & its mechanisms, in the way other sciences work, such as the Genome project. The information needs to be contributed and used by the community – requiring a change not only to technical tools, but more importantly to organizational thinking.

Fernando Serván brought the session and the first day to a close with a look into the recent experiences of the Food and Agriculture Organization of the United Nations. They have struggled with tools that require plain text, and with questions about how to address localization issues as their content goes out beyond the reach of their tool s onto social media. He called for increased interoperability. Each part of the process and each language resource has its own standard (TBX, TMX, XLIFF) but it is difficult to make them work together.

Over the course of the day we heard of many interesting initiatives where standards play a key role. There was a general concern for increasing and maintaining interoperability, and we heard repeated calls for greater public participation in initiatives to move forward the multilingual Web.

The second day was devoted to Open Space breakout sessions. The session topics were: MultilingualWeb-LT Conclusions; Semantic Resources and Machine Learning for Quality, Efficiency and Personalisation of Accessing Relevant Information over Language Borders; Speech Technologies for the Multilingual Web; MultilingualWeb, Linked Open Data & EC "Connecting Europe Facility"; Tools: issues, needs, trends; and,Multilingual Web Sites,

The discussions produced a good number of diverse ideas and opinions, and these are summarized at the bottom of this report. There are also links to slides used by the group leaders and video to accompany them, which give more details about the findings of each of the breakout groups.

Welcome session & Keynote talk

Piet Verleysen, European Commission Resources Director, responsible for IT at the Directorate-General for Translation, welcomed the participants to Luxembourg and gave a brief speech encouraging people to come up with ideas that will make it easier to work with the Multilingual Web.

This was followed by a brief welcome from Kimmo Rossi, Project Officer for the MultilingualWeb project, and working at the European Commission, DG for Information Society and Media, Digital Content Directorate.

Iván Herman, Semantic Web Activity Lead at the W3C, gave the keynote speech, entitled "What is Happening in the W3C Semantic Web Activity?". In this talk he gave, in a short time, a information rich overview of the current work done at the W3C related to the Semantic Web, Linked Data, and related technical issues. The goal was not to give a detailed technical account but, rather, to give a general overview that could be used as a basis for further discussions on how that particular technology can be used for the general issue of Multilingual Web. Most of the presentation described the aims and goals of the Semantic Web work, and the various aspects of the technology, and described the status of each of these areas. Towards the end of the presentation, Ivan drew links between the Semantic Web and Multilingual Web, proposing that each could help the other. The bullet points below include some of the areas where that might be the case.

  • The Web is full of data, not just documents, and in fact mostly there is data out there, from government data to information about restaurants and film reviews. More and more applications today rely on that data – it's their bread and butter. But we need to avoid silos of information. The real value of the Web is the links between information. Data needs to be linked as well, ie. integrated across the Web. The Semantic Web is a bunch of technologies with the goal of building such a Web of data and managing and exploiting the links.
  • The Semantic Web has powerful technologies to categorize knowledge (e.g., SKOS and other vocabulary standards) that can be used to help the Multilingual Web. These may help in "binding", translating, etc., information in different languages. Thesauri can be created with labels in different languages, and some level of knowledge extraction and analysis could be done on those.
  • Via the Linked Data it is possible to interlink information in different languages. DBpedia integrates the various Wikipedia instances.
  • It is possible to tag texts using the same terms (e.g., via stable URIs).
  • Semantic Web technologies and practice have to consider the challenges of MLW. RDF has a very simple way of representing literals (copied from XML): a single language tag. Is it enough?
  • Ontologies/vocabularies are typically monolingual, terms are mostly English.
  • Practice of vocabulary design very often forgets about MLW issues (first name, last name...).
  • IRI equivalence is a major headache in practice.

Developers session

The developers Session was chaired by Reinhard Schäler of the The University of Limerick's Localisation Research Centre.

Jan Anders Nelson, Senior Program Manager Lead at Microsoft, started off the Developers session with a talk entitled, "Support for Multilingual Windows Web Apps (WWA) and HTML 5 in Windows 8". The talk looked at support for web apps running on Windows8 and stepped through the workflow of how a developer can optimize creation of multilingual applications, organizing their app projects to support translation and available related tools, that make creation of multilingual apps easy for anyone considering shipping in more than one market, perhaps for the first time. Other significant remarks:

  • Apps are an important part of the Windows8 operating system, and Microsoft wants to help developers localise into the 109 languages that Windows8 supports.
  • Microsoft is now emphasizing support for linguistic markets within regional markets, such as Spanish speakers in the USA.
  • The Multilingual App Toolkit (MAT) provided by Microsoft gives developers multilingual resource support that enables them to focus on source language.
  • The tool provides a pseudo-translation function, for identifying up-front potential translation issues, such as hard coded, concatenated, or truncated strings and other visual issues, an XLIFF output tool and editor, Microsoft Translator support for the translator, and mechanisms for exporting and importing translations.

Tony Graham, Consultant at Mentea, talked about "The XSL-FO meets the Tower of Babel". XSL Formatting Objects (XSL-FO) from the W3C is an XML vocabulary for specifying formatting semantics. While it shares many properties with CSS, it is most frequently used with paged media, such as formatting XML documents as PDF. XSL-FO 2.0 is currently under development, and one of its top-level requirements is for further improved non-Western language support. However, the requirement for improved support in XSL-FO 2.0 is actually less specific than the 1998 requirements for XSL 1.0 since the W3C recognized that they didn't have the knowledge and expertise to match their ambitions. For that, they would need more help -- either from individual experts or from the W3C forming more task forces along the lines of the Japanese Layout Task Force to capture and distill expertise for use by all of the W3C and beyond. Other significant remarks:

  • XSL-FO has always been good at the "big picture" of multilingualism. For example, XSL-FO defines multiple writing modes to cover the requirements of just about every known script. XSL-FO also has margin-*, border-*, padding-*, and other properties much like CSS, but they are defined in terms of -before, -after, -start, and -end rather than fixed -top, - bottom, -left, and -right. Etc.
  • In contrast to the XSL 1.0 requirements, the 2008 XSL 2.0 requirements only has a small amount of text related to internationalization because the Working Group needs expertise to inform them of requirements.
  • Tony called for people to consider creating a Multilingual Layout Community Group at the W3C, which could discuss layout requirements for various scripts and spin off task forces within the Internationalization Activity to develop the detailed requirements themselves (in the style of the Japanese Layout Task Force and its Requirements for Japanese Layout document.

Richard Ishida, Internationalization Lead at the W3C, and Jirka Kosek, XML Guru at the University of Economics, Prague, co-presented "HTML5 i18n: A report from the front line". The talk briefed attendees on developments related to some key markup changes in HTML5. In addition to new markup to support bidirectional text (eg. in scripts such as Arabic, Hebrew and Thaana), the Internationalization Working Group at the W3C has been proposing changes to the current HTML5 ruby markup model. Ruby are annotations used in Japanese and Chinese to help readers recognize and understand ideographic characters. It is commonly used in educational material, manga, and can help with accessibility. The Working Group has also proposed the addition of a flag to indicate when text in a page should not be translated. This talk delivered an up-to-the-minute status on the progress made in these areas. Other significant remarks:

  • HTML5 now includes support for a bdi element and an auto value for the dir attribute that help estimate direction of phrases and isolate those phrases from the surrounding text. This helps address some of the current issues with bidi support in HTML, especially for text that is injected into a document at run time.
  • An updated version of the Japanese Layout Requirements document was published on 3 April. This is an invaluable resource for specification developers and implementers wanting to support Japanese.
  • Richard described a number of questions surrounding the markup for ruby support in HTML5 that are currently being discussed.
  • The HTML5 Working Draft recently added a translate attribute that allows you to identify text that should not be translated, either by humans or by machine translation. It is already supported by online translation services from Microsoft and Google and content formats such as DITA and DocBook.

The Developers session on the first day ended with a Q&A period with questions about Indic layout, use of CSS and translate flags, applicability of new HTML5 features to other formats, fine-grained identification of language variants for localization with the MAT tool, extensions to the translate flag, and formats supported by the MAT tool. For more details, see the related links.

Creators session

This session was chaired by Jan Nelson of Microsoft.

Brian Teeman, Director School of Joomla! JoomlalShack University, gave the first talk for the Creators session, "Building Multi-Lingual Web Sites with Joomla! the leading open source CMS". Joomla is the leading Open Source CMS used by over 2.8% of the web and by over 3000 government web sites. With the release in Jan 2012 of the latest version (2.5) building and creating truly multi-lingual web sites has never been easier. This presentation showed how easy it is to build real multi-lingual web sites and not to rely on automated translation tools. Other significant remarks:

  • A problem with machine translation of content is that it doesn't get indexed by search engines.
  • A content management system, such as Joomla, needs to allow authors to provide translated versions of pages, or to author locally adapted content for a site.
  • Joomla uses categorisation of articles to allow users to switch languages across locally adapted content sites and still arrive at a page which is the same or is related to the page they switched from.

Loïc Dufresne de Virel, Localization Strategist at Intel Corp., spoke on "How standards (could) support a more efficient web localization process by making CMS - TMS integrations less complicated". As Intel just deployed a new Web Content Management system, which they integrated into their TMS, they had to deal with multiple challenges, and also faced a great deal of complexity and customization. In this 15-min talk, Loïc looked into what Intel did well, what they did wrong, what we could have done better, and attempted to put a dollar figure on the cost of ignoring a few standards. Other significant remarks:

  • What would it take to get any CMS to "communicate" efficiently with any TMS? Do we need new standards to achieve this? Not really in my opinion, but we do need a change in behavior, and we need to put the existing standards to better use.

Gerard Meijssen, Internationalization/Localization Outreach Consultant at the Wikimedia Foundation, presented "Translation and localisation in 300+ languages ... with volunteers. There are over 270 Wikipedias and over 30 languages are requesting one. These languages represent most scripts and represent small and large populations. The Wikimedia Foundation enables the visibility of text with web fonts and it supports input methods. There is a big multi-application localization platform at translatewiki.net and they are implementing translation tools for their "pages" for documentation and communication to their communities. To do this, they rely on standards. Standards get more relevance as they are implemented in more and more places in their software. Some standards don't support all the languages that the Wikimedia Foundation supports. Other significant remarks:

  • We would like to see Europeana localized into other languages than just the European ones.
  • Not all languages have quality, freely-licenced Unicode font support. Wikipedia Foundation is working with Mozilla and RedHat on Indian font development.
  • A solution must also be found for keyboard input, and WF has developed some software to help with that, called Narayam.
  • WF supports more languages than are supported by CLDR, and the information is often not accurate (including English).
  • One thing that stops CLDR growing faster is that it is truly hard to enter the right data – even if you are using trained linguists as developers. Another problem is that you have to stop submitting data at various points, while the Unicode Consortium evaluates the data received by a given cut-off point.
  • We need to be able to identify languages in such a way that spell-checkers and such know how to behave, and so that search engines can find text in a specific language – and the language may be a variant or unusual language.
  • We are looking for a more humane face on the CLDR input process, how we can get more people involved in contributing CLDR data, or if you know people who can help better support languages on the Web, talk to us and to the W3C.

The Q/A part of the Creator session included questions about support for users of Wikipedia in multiple languages, which standard Wikipedia uses for language tagging, how to address the need for a more standard approach to content development in general, how to motivate people to localize for free, and whether Joomla supports styling changes at the point of translation. For more details, see the related links.

Localizers session

This session was chaired by Arle Lommel of GALA and DFKI.

Spyridon Pilos, Head of Language Applications Sector in the Directorate General for Translation at the European Commission, talked about "The Machine Translation Service of the European Commission". The Directorate General for Translation (DGT) has been developing, since October 2010, a new data-driven machine translation service for the European Commission. MT@EC should be operational in the second semester of 2013. One of the key requirements is for the service to be flexible and open: it should enable, on one hand, the use of any type of language resource and any type of MT technology and, on the other, facilitate easy access by any client (individual or service). Spyridon presented the approach taken and highlighted problems identified, as pointers to broader needs that should be addressed. Other significant remarks:

  • Each web site has its own way of organizing multilingual resources, because there is no standard. This makes it difficult to interface with different environments using the same tools.
  • There is no standard multilingual packaging for linguistic resources.
  • We need something that makes it easier for us to get multilingual information from the Web and put multilingual information out to the Web, ie. standard data models, and structures for data storage and publication, so that we don't waste time on conversions.
  • The Commission is also a founding member of Linport (Language Interoperability Portfolio) which is looking at standard ways of exchanging information.
  • But we are also ready to change, and we are exploring XLIFF at the moment. We are also ready to participate as major users and 'testers', but also to propose ideas about how a large organization needs to work with data.

Matjaž Horvat, L10n driver at Mozilla, talked about "Live website localization". He reported on Pontoon, a live website localization tool developed at Mozilla. Instead of extracting website strings and then merging translated strings back, Pontoon can turn any website into editable mode. This enables localizers to translate websites in-place and also provides context and spatial limitations. At the end of the presentation, Matjaž ran through a demo. You can see a limited version of the demo page.

David Lewis, Research Lecturer at the Centre for Next Generation Localisation at Trinity College Dublin, spoke about "Meta-data interoperability between CMS, localization and machine translation: Use Cases and Technical Challenges". January 2012 saw the kick-off of the MLW-LT ("MultilingualWeb - Language Technologies") Working Group at the W3C as part of the Internationalization Activity. This WG will define metadata for web content (mainly HTML5) and "deep Web" content (CMS or XML files from which HTML pages are generated) that facilitates content interaction with multilingual language technologies such as machine translation, and localization processes. The Working Group brings together localization and content management companies with content and language metadata research expertise, including strong representation from the Centre for Next Generation Localisation. This talk presented three concrete business use cases that span CMS, localization and machine translation functions. It discussed the challenges in addressing these cases with existing metadata (e.g. ITS tags) and the technical requirements for additional standardised metadata. (This talk was complemented by a breakout session to allow attendees to voice their comments and requirements in more detail, in order to better inform the working group. See below.) Other significant remarks:

  • The MultilingualWeb-LT Working Group will aim to extend ITS to cover additional data categories, and make it relevant to HTML5 and metadata embedded in CMS applications. It will aim to use existing approaches, such as microdata or RDFa, rather than invent new markup.
  • It's important to look at everything between the content author and the end user.
  • We need additional people to step up and feed into these requirements.
  • While the MultilingualWeb-LT is moving forward, people should be thinking about what to tackle next. There are many other things, such as multimodal interaction, captioning and audio-visual content, localizing JavaScript, etc. We need to start having conversations about these areas, and discuss who needs to be involved to work on them.

During the Q&A session questions were raised about crowd-sourcing issues for the Pontoon approach, and how extensible the Pontoon approach is for wider usage. There was some discussion of open data policies related to the MT@EC system and opportunities for collaborative work (eg. with Adobe), as well as whether it makes sense to control source text for machine translation. For details, see the related links.

Machines session

This session was chaired by Felix Sasaki of DFKI.

Peter Schmitz, Head of Unit "Enterprise Architecture"at the European Commission, talked about "Common Access to EU Information based on semantic technology". Publications Office is setting up a common repository to make available at a single place all metadata and digital content related to public official EU information (law and publications) in a harmonised and standardised way in order: to guarantee to the citizen a better access to law and publications of the European Union; to encourage and facilitate reuse of content and metadata by professionals and experts. The common repository is based on semantic technology. At least all official languages of the EU are supported by the system, thus the system is a practical example of a multilingual system accessible through the Web. Other significant remarks:

  • The goal is to make available at a single place all metadata and digital content managed by the Publications Office in a harmonised and standardised way.
  • The project has adopted a number of standards to increase interoperability. These include METS (metadata encoding and transmission standard) (ingestion protocol), Dublin Core (core metadata definitions), FRBR (data model/ontology), Linked Open Data (LOD) (access/reuse), SPARQL (access/reuse).
  • But the project is also involved in the definition of standards, in the following areas: Core metadata (restricted shared set of metadata for each resource based on Dublin Core, to enable global search), Common authority tables (harmonize metadata), Exchange protocol for EU legislative procedures (interoperability).
  • Also under preparation is the European Legislative Identifier (ELI) (for interoperability).

Paul Buitelaar, Senior Research Fellow at the National University of Ireland, Galway, spoke about the "Ontology Lexicalisation and Localisation for the Multilingual Semantic Web". Although knowledge processing on the Semantic Web is inherently language-independent, human interaction with semantically structured and linked data will be text or speech based – in multiple languages. Semantic Web development is therefore increasingly concerned with issues in multilingual querying and rendering of web knowledge and linked data. The Monnet project on 'Multilingual Ontologies for Networked Knowledge' provides solutions for this by offering methods for lexicalizing and translating knowledge structures, such as ontologies and linked data vocabularies. The talk discussed challenges and solutions in ontology lexicalization and translation (localization) by way of several use cases that are under development in the context of the Monnet project. Other significant remarks:

  • The project is about translating specific labels for concepts in the context of ontologies. Some of which are exact matches, some are narrow matches.
  • Another goal is research: can we exploit the domain semantics inherent in the ontology to improve machine translation?
  • Ontological lexicalization use cases include ontology localization and verbalisation, ontology-based information extraction, ontology learning, etc.

Tadej Štajner, Researcher at the Jožef Stefan Institute, gave a talk about "Cross-lingual named entity disambiguation for concept translation". The talk focused on experience at the Jožef Stefan Institute in developing an integrated natural language processing pipeline, consisting of several distinct components, operating across multiple languages. He demonstrated a cross-language information retrieval method that enables reuse of the same language resources across languages, by using a knowledge base in one language to disambiguate named entities in text, written in another language, as developed in the Enrycher system (enrycher.ijs.si). He discussed the architectural implications of this ability on the development practices and its prospects as a tool for automated translation of specific concepts and phrases in a content management system. Other significant remarks:

  • The HTML5 translate attribute helps to prevent translation of names, but someone has to mark up all the names. The goal of Enrycher is to automatically identify entities such as names in content so that they can be annotated.
  • Various algorithms are used to achieve this, although the success is language-dependent. Another issue is that in some languages sources of contextual information are missing or sparse, so additional algorithms are used to overcome this.
  • Future work needs to (a) re-use language and semantic resources to improve performance on NLP tasks across different languages, and (b) lower the barriers for using this technology for enriching content within a CMS.

Among the topics discussed during the Q&A session was a question about whether the CELLAR project is sharing with other data sets, such as government data, how domain lexicon generation is done in Monnet, how Enrycher name disambiguation deals with city names that occur many times in one country, and how machine translation relates to the identification of language neutral entities in Monnet and Enrycher. For the details, follow the related links.

Users session

This session was chaired by Tadej Štajner of the Jožef Stefan Institute.

Annette Marino, Head of Web Translation Unit at the European Commission, and Ad Vermijs, DGT Translator, started the Users session with a talk entitled "Web translation, public service & participation". For most Europeans, the internet provides the only chance they have for direct contact with the EU. But how can the Commission possibly inform, communicate and interact with the public if they don't speak their language on the web? With the recent launch of the European Citizens' Initiative website in 23 languages, there's no doubting the role of web translation in participatory democracy, or the Commission's commitment to a multilingual web presence. But as well as enthusiasm, they need understanding – of how people use websites and social media, and what they want – so that they can make best use of translation resources to serve real needs. As the internet evolves, the Commission is on a steep learning curve, working to keep up with the possibilities – and pitfalls – of web communication in a wide range of languages. Other significant remarks:

  • For most Europeans, the internet is the easiest way to get information from the authorities, or interact with them. For the EU the internet is often the only opportunity to have for direct contact with those who rule. For organizations like the EU institution the importance of good web communication therefore cannot be overestimated.
  • If a DG plans a multilingual web project, the sooner the web unit is involved, the better it is.
  • Translating EU news requires local perspective to be taken into account! On the day of the European museum awards, UK readers probably prefer to read about a UK museum, whereas it's more interesting for Poles to read about an award winning museum in their country. Machine translation can do a lot, but it can't replace the name of a London museum with that of a Polish museum in a news article.
  • With the recent launch of the European Citizens' Initiative website in 23 languages, there's no doubting the role of web translation in participatory democracy, or the Commission's commitment to a multilingual web presence.

Murhaf Hossari, Localization Engineer at University College Dublin, talked about how "Localizing to right-to-left languages: main issues and best practices". Internationalization and localization efforts need to take extra care when dealing with right-to-left languages due to specific features those languages have. Many localization issues are specific to right-to-left languages. The talk attempted to categorize those issues that face localizers when dealing with right-to-left languages with special focus on text direction part and handling bidirectionality. The talk also mentioned best practices and areas for improvements. Other significant remarks:

  • Use relative layouts and avoid hard coding for user interfaces to avoid problems with right-to-left versions.
  • The Unicode Bidirectional Algorithm has issues for representing some bidirectional text on the Web, such as that paragraph direction isn't always detected correctly, and there are problems with positioning of weak characters.
  • Using Unicode directional control characters is not a good solution because they have to be added manually, they aren't trivial to use, they are invisible, and need to be checked at runtime. We need to try to detect patterns of behavior here and add new logic to the Unicode Bidirectional Algorithm itself, rather than adding control characters or markup to content.

Nicoletta Calzolari, Director of Research at CNR-ILC, talked about how "The Multilingual Language Library". The Language Library is a quite new initiative – started with LREC 2012 – conceived as a facility for gathering and making available, through simple functionalities, all the linguistic knowledge the field is able to produce, putting in place new ways of collaboration within the Language Technology community. Its main characteristic is to be collaboratively built, with the entire community providing/enriching language resources by annotating/processing language data and freely using them. The aim is to exploit today's trend towards sharing for initiating a collective movement that works also towards creating synergies and harmonization among different annotation efforts that are now dispersed. The Language Library could be considered as the beginning of a big Genome project for languages, where the community will collectively deposit/create increasingly rich and multi-layered linguistic resources, enabling a deeper understanding of the complex relations between different annotation layers. Other significant remarks:

  • Sharing language resources and tools via the META-Share initiative is a big step forward, but it's not enough. We need a paradigm shift towards Collaborative iResources, ie. resources that we build in a collaborative way. We need thousands of people working together on the same big project, like in other advanced sciences.
  • Accumulation of massive amounts of multi-dimensional data is the key to foster advancement in our knowledge about language & its mechanisms.
  • The key issues are not technical, but organizational. This means a change of mentality: going beyond "my approach" to some "compromise" allowing to go for big returns through synergy.

Fernando Serván, Senior Programme Officer at the Food and Agriculture Organization of the United Nations, talked about how "Repurposing language resources for multilingual websites". This presentation addressed lessons learned regarding the re-use of language resources (translation memories in TMX, terminology databases in TBX) to improve content and language versions of web pages. It addresses the need for better integration between existing standards, the needs for interoperability and the areas where standards and best practices could help organizations with a multilingual mandate. Other significant remarks:

  • As we expand our presence into social media, such as Facebook and Twitter, we are unable to use some of the technology we have developed in-house, and are encountering some obstacles to multilingual deployment.
  • The FAO found that the MT systems they were using operated sub optimally with marked up text, and have to reduce all their content to plain text. This means a loss of information from the richer contexts, and difficulties applying things like the new translate attribute in HTML5
  • Documentation was also a challenge: technical documentation is not summarized and it is complex (except moses-for-mere- mortals).
  • Another difficulty related to interoperability. Each part of the process and language resource has its own standard (TBX, TMX, XLIFF) but it is difficult to make them work together.

The Q&A, session dealt with questions about whether the data in the Language Library is freely licensed, and what that means, and how to assess the quality of the data that is being gathered. Other questions: whether the DGT is translating social media content, and how the FAO addresses quality issues with MT and whether they use SKOS. For the details, follow the related links.

Discussion sessions

This session was chaired by Jaap van der Meer of TAUS.

Workshop participants were asked to suggest topics for discussion on small pieces of paper that were then stuck on a set of whiteboards. Jaap then lead the group in grouping the ideas and selecting a number of topics for breakout sessions. People voted for the discussion they wanted to participate in, and a group chair was chosen to facilitate. The participants then separated into breakout areas for the discussion, and near the end of the workshop met together again in plenary to discuss the findings of each group. Participants were able to move between breakout groups.

At Luxembourg we split into the following groups:

  1. MultilingualWeb-LT Conclusions, lead by Dave Lewis
  2. Semantic Resources and Machine Learning for Quality, Efficiency and Personalisation of Accessing Relevant Information over Language Borders, led by Timo Honkela
  3. Speech Technologies for the Multilingual Web, led by Anabella Barreiro
  4. MultilingualWeb, Linked Open Data & EC "Connecting Europe Facility", led by Christian Lieske
  5. Tools: issues, needs, trends, led by Elena Rudeshko
  6. Multilingual Web Sites, led by Manuel Tomas Carrasco Benitez
  7. Language policy on multilingual websites, led by Konrad Fuhrman

Summaries of the findings of these groups are provided below, some of which have been contributed by the breakout group chair.

MultilingualWeb-LT Conclusions

This session was facilitated by David Lewis, one of the co-chairs of the new MultilingualWeb-Language Technology W3C workgroup. He introduced MultilingualWeb-LT as the successor to the ITS standard, but with the aim to better integrate localization, machine translation and text analytics technologies with content management. Therefore, the standardization of metadata for interoperability between processes in content creation, through localization/translation to publication was now in scope. MultilingualWeb-LT would retain successful features of ITS, including support for current ITS data categories, the separation of data category definitions from its mapping to a specific implementation and the independence between data categories so that conformance could be claim by implementing any one. It would however address implementation in HTML5, using existing metadata annotation such as RDFa and microdata and round-trip interoperability with XLIFF.

The session focussed largely on the general characteristics of data categories more than on specific data categories, e.g. whether it was generated automatically or manually (or both) or whether it affected the structure of a HTML document or the interaction with other metadata processing, e.g. for style sheets or accessibility. One specific data category to emerge was the priority assigned to content for translation. A point that repeatedly emerged was the importance of relating metadata definitions to specific processes. This would influence how metadata for processing provenance and processing instruction would be represented and was also important in defining processing expectations. The challenge identified here was that process flow definitions are intimately linked to localization contracts and service level agreements. Therefore trying to standardize process models could meet resistance due to the homogenizing effect on business models and service offerings. CMS-based workflows that were contained within a single content-creating organization and assembled from multiple LT components, rather than outsourced to LSPs, may be more accepting of a common process definition. It was recognized, however, that language technologies, including machine translation and crowd-sourcing may mean that many process boundaries will change, quickly dating any standardised process model.

Semantic Resources and Machine Learning for Quality, Efficiency and Personalisation of Accessing Relevant Information over Language Borders

No short summary of this session was provided. Please see the slides and video links.

Speech Technologies for the Multilingual Web

The web is becoming more vocal and speech contents are increasingly being brought onto the Internet in a variety of forms: e-learning classes, motivational talks, broadcast news captioning, among others. These contents strengthen and make more appealing and dynamic the information access. The Discussion Group on Speech Technologies for the Multilingual Web concluded that speech technology needs to be discussed, and best practices and standards are needed to address current gaps in the web's speech contents, pointing out the socioeconomic value of spoken language in enabling more efficient human communication, helping businesses advertising, marketing and selling their products and services, addressing educational inclusion and special communication needs, and creating new opportunities to spread and share knowledge on a global scale.

MultilingualWeb, Linked Open Data & EC "Connecting Europe Facility"

Kimmo Rossi (European Commission; EC) and Christian Lieske (SAP AG) proposed to look into the status quo, and possible actions concerning the intersection of the MultilingualWeb Thematic Network, Linked Open Data & the EC's "Connecting Europe Facility". The breakout attracted attendees from constituencies such as users/service requesters (e.g. users and implementers of Machine Translation Systems), facilitators (e.g. the EC), Service providers (e.g. for translation-related services), and enablers (World Wide Web consortium, W3C). Outcomes of the breakout included a first step towards a mutual understanding of the topic and subtopics, information on actions that already have been started, and suggestions for follow-ups

The general question was: How should work started in Multilingual Web (MLW) Network continue with a view towards Linked Open Data (LOD) and Connecting Europe Facility (CEF)? Currently, it is still hard to find good freely (re)usable NLP resources (MT systems, rules for them, dictionaries etc.) as LOD on the Web. In addition, a change of paradigm for data creation is needed; currently it is hard to have a unified "picture" across languages (example "population of Amsterdam" differs across Wikipedia language versions). A new, language technology supported paradigm, pursued e.g. in the Monnet project, could be to create resources in one language or even language agnostic/neutral and use (automatic) approaches to generate resources across languages

The attendees committed to some immediate action items, like: making the relationship between MLW, linked open data and CEF facility easy to understand for everyone; gather more use cases for language resources and language related service; and further discussions about this topic at dedicated events, like the upcoming MultilingualWeb-LT workshop (11 June, Dublin). Further information about the session is available in an extended session report.

Tools: issues, needs, trends

Tools development for language services plays an important role to bring innovations and simplify everyday work. Due to this it requires attention from all stakeholders: tolls developers, tools users on different levels, including customers and vendors. However the process is not smooth. The issues detected during discussion are:

  1. developers confirm that there is a difficulty to find a good software developers with good level of English for global teams
  2. developers confirm that tools end users feedback is important to create more effective processes for language services industry at the end
  3. translators confirm that due to different standards and new tools they should permanently learn
  4. most of the group agree that effective standardization is required
  5. to improve the items above it is important to communicate with all L10N stakeholders
  6. there is still not enough level of education in relation to language services industry needs in most of the European countries.

Multilingual Web Sites

The main conclusions are to get organized and produce standards.

An approach is to create a Multilingual Web Sites Community Group at the W3C; in the mean time, this has been done.

The starting point is the document Open architecture for multilingual web sites. The final document might be quite different; this document also helps as a primer.

Following a preliminary exchange, there was a round-robin where each participant explained his/her point of view. By the end, it seemed that the participants were describing the same object from different angles and that this break-out group could only scratch the surface of the problem.

Tomas Carrasco filled in some background information, in particular the point of view of the user and the webmaster; and techniques such as Transparent Content Negotiation (TCN). Multilingual web sites (MWS) has two main stakeholders:

  1. the User, ie. the grand public: the end users viewing web sites. Hence, one has to be very conservative and changes must be very gradual. New techniques take a long time to propagate; typically five to ten years. Users should expect different MWSs to behave in a similar fashion. To most users, MWS are monolingual; selecting languages and similar are barriers to get to the information in the desired language.
  2. the Webmaster, ie. all the participants in the creation of MWS: authors, translators and publishers (ATP-chain). Multilingual is really the point of view of the webmaster. The interoperability among the tools in the ATP-chain is a key factor. New techniques should take less than three years to propagate.

Language policy on multilingual websites

What was generally represented poorly in the talks during the 4th MultilingualWeb workshop entitled "The Way Ahead" was the debate about multilingual content itself and its handling: what does one present in which language for whom? How to avoid a localization that the user relies on, but that undermines their confidence in the content translated into their language, etc.? The Open Space discussions that took place on 16 March provided an opportunity to discuss these questions.

The group, composed of participants in other European institutions, of the FAO and of Lionbridge, arrived at the following conclusions:

  1. Problems:
    1. The point of view of the user/client is often lacking (as, for example, at this workshop where there were no representatives of consumer associations etc.
    2. Things are left in trust: others decide what needs to be translated, updated and how it should be localized.
    3. Often the linguistic policy of sites is not explicit.
    4. The network is linguistically fragmented: for example, research engines don't find information presented in other languages.
  2. Solutions:
    1. User information about missing content in the given language version: explicit linguistic policy, titles and table of contents translated, summaries and standard descriptions, etc.
    2. Take into account and manage user feedback/suggestions – to be considered: what are the best methods, tools, practices to remedy the problems mentioned above?

alt

Author: Richard Ishida. Contributors: Scribes for the workshop sessions, Jirka Kosek, Arle Lommel, Felix Sasaki, Reinhard Schäler, and, Charles McCathieNevile. Photos in the collage at the top and various other photos, courtesy Richard Ishida. Video recording by the European Commission, and hosting of the video content VideoLectures.