Barcelona's Digital Archives Push Forward on Duplicate Image Replacement This Week
City-linked institutions are accelerating a long-overdue clean-up of redundant visual records, with implications for heritage access and AI-training datasets alike.
City-linked institutions are accelerating a long-overdue clean-up of redundant visual records, with implications for heritage access and AI-training datasets alike.

Barcelona's cultural and municipal archiving sector moved decisively this week to tackle a problem that has accumulated quietly for years: thousands of duplicate images clogging public digital collections, distorting search results, and bloating storage costs across city-linked repositories. The push, coordinated in part through the Consorci de Serveis Universitaris de Catalunya (CSUC), targets legacy digitisation projects that created overlapping image files when collections were migrated to new platforms between 2018 and 2023.
The timing is not accidental. With AI companies increasingly licensing municipal and university image archives for training datasets, the quality and integrity of those catalogues carries real commercial and legal weight. A collection riddled with near-identical duplicates can skew model outputs, trigger copyright complications, and reduce licensing fees that publicly funded institutions badly need. For a city already managing a strained municipal budget — Mayor Jaume Collboni's administration has prioritised housing and tourism enforcement spending — finding revenue from digital assets matters.
On Tuesday, the Arxiu Municipal de Barcelona, headquartered on Carrer de Santa Caterina in the Sant Pere neighbourhood, confirmed it had completed the first phase of an automated deduplication sweep across its photographic holdings. The sweep covered roughly 340,000 image files, according to internal documentation circulated to partner institutions. A second phase, expected to run through September, will address the archive's audiovisual and cartographic collections.
Separately, the Biblioteca de Catalunya on Carrer de l'Hospital in the Raval reported that a pilot programme using open-source perceptual hashing tools — software that identifies visually similar images even when filenames differ — had flagged approximately 12 percent of one digitised newspaper collection as probable duplicates. That collection, drawn from early twentieth-century Catalan-language press, had been scanned across two different grant-funded projects, once in 2019 and again in 2021, producing overlapping records that had sat undetected in the public-facing catalogue.
The practical consequence for everyday researchers has been real: search queries for historical Barcelona street photography, for example, returned repeated results from the same source image uploaded under different metadata tags, making it harder to assess the genuine breadth of a collection. Librarians at the Biblioteca de Catalunya have been manually flagging problem records since at least early 2025, but the volume made human review alone unworkable.
Barcelona's situation is not unique, but the city's particular combination of active heritage digitisation, a strong university sector, and growing interest from European AI firms makes the stakes higher here than in many comparable cities. The Barcelona Supercomputing Center at the Nexus II building on the UPC campus in the Zona Universitària has been working with cultural institutions on exactly these data-quality questions as part of preparatory work for Catalan-language language model development.
The cost of storage is a concrete pressure point. Commercial cloud storage for uncompressed archival image files runs at roughly €0.02 per gigabyte per month at current enterprise rates, and large municipal archives can hold several hundred terabytes. Eliminating confirmed duplicates — even a conservative 10 percent of holdings — produces measurable annual savings that can be redirected to new acquisitions or improved public access tools.
The deduplication work also intersects with the city's short-term rental crackdown in a less obvious way: several contested Airbnb listings have been identified partly through cross-referencing duplicate property images appearing in both tourism platforms and municipal planning records, a use case that archivists say validates the investment in cleaner catalogues.
Institutions involved in the current sweep say public-facing catalogue updates will begin appearing from mid-July onward. Researchers who use the Arxiu Municipal's online portal or the Biblioteca de Catalunya's digital collections tool are advised to note that some previously indexed records may temporarily disappear or be consolidated under updated identifiers as the deduplication process completes. No original source materials are being deleted — only redundant digital copies are being removed or merged.
How does this story make you feel?
Spread the word
About this article
Published by The Daily Barcelona
Daily brief
Free, in your inbox before 7am. Weekdays.
More in News