Search by language now available on eluxemburgensia.lu For better access to multilingual content

The eluxemburgensia.lu portal brings together all resources digitised by the National Library and has been equipped with a new search function. From now on, users can filter their search by language, for example German, French or Luxembourgish.

The newspapers, books and historical magazines digitised by the BnL bear witness to the culture of multilingualism present in Luxembourg and often bring together different languages on the same page. The new filtering option allows users who prefer to read articles in a specific language to filter all content on eluxemburgensia.lu and to further target their search.

The two main languages on the portal are German, which accounts for 67% of the total, followed by French at 31.3%. Luxembourgish is represented in a much smaller proportion, with just over 118,123 articles (or 1.5%), while English, with 10,149 articles (or 0.1%), is considered a minority. 14 other languages have been identified, such as Latin, Italian, Portuguese and Polish.

What is the language detection process?

While language determination for monographs is based on bibliographic records, the process is more complicated for serial articles for which no such data exists. As optical character recognition (OCR) is not yet perfect and as some types of texts, such as lists of names, do not have an identifiable language, the algorithm developed by the BnL uses several complementary heuristics:

  • A vote between the standard algorithms fasttext, cld3 and langid
  • Dictionaries of the languages identified in the collection
  • Measures of OCR quality
  • Information about other articles in the periodical

For languages with fewer than 1,000 articles, the texts are reviewed by hand to check that the algorithm has indeed determined the correct language. This allows some accuracy to be maintained for lesser used languages. For the remaining articles, the language is not manually reviewed and inaccuracies remain. In addition, there are multilingual articles (e.g. one part in French, another in German) for which the dominant language is chosen. Finally, some content, such as lists of sports results, do not lend themselves to this kind of language determination process.

Key points for developers

ICT professionals and developers will appreciate that the language information has been integrated into the metadata of the digitised documents and that it can be used for analysis. In fact, algorithms for automatic language processing provide a good basis for the development of new tools for the analysis of historical texts.

Last update