Thousands of duplicate image files — some scanned as many as four or five times across different departmental workflows — have accumulated inside the Arxiu Municipal de Barcelona's central digital repository, creating a backlog that archivists and city IT staff have spent much of this week working to untangle. The problem, long flagged internally, came to a head after a June audit revealed the duplication had inflated storage costs and was hampering public search tools on the archive's online portal.
The timing matters. Barcelona City Hall has staked considerable political capital on its 2024–2027 Digital Transition Plan, a framework approved under Mayor Jaume Collboni that committed the municipal administration to making at least 80 percent of its historical photographic collections searchable online by the end of next year. Duplicate images don't just waste server space — each redundant file requires its own metadata entry, meaning cataloguers at the Arxiu's Sant Pau premises on Carrer de la Palla have been unknowingly describing the same photograph multiple times under slightly different file names.
How the Backlog Built Up
The root cause is structural. Over the past six years, at least three separate city departments — urban planning, the Institut de Cultura de Barcelona (ICUB), and the district offices of Gràcia and Eixample — have each run parallel scanning campaigns using different equipment and naming conventions. When files were eventually migrated to the shared repository hosted by the Consorci de Serveis Universitaris de Catalunya (CSUC), deduplication protocols were not applied consistently. By this week, IT teams estimated that roughly 12 percent of all image files in the municipal archive were exact or near-exact duplicates, though that figure is still being verified.
The issue is not unique to Barcelona. The Arxiu Nacional de Catalunya in Sant Cugat del Vallès has dealt with similar challenges since launching its own mass digitisation effort in 2021, and digital heritage specialists across Europe have increasingly called for shared deduplication standards at the point of ingestion rather than after the fact. Barcelona, however, faces particular pressure because the municipal archive's online portal — used by researchers, journalists, and the general public to access historical images of the city — has become noticeably slower since early June, with some advanced searches timing out before returning results.
This Week's Response and What Comes Next
City technicians began deploying a perceptual hashing tool on Monday, July 1, designed to flag visually identical or near-identical images even when file names differ. The process is expected to take at least three weeks to run across the full repository, which holds upwards of 2.3 million digitised items. Once flagged, duplicate files will not be automatically deleted — archivists will review each cluster manually, a labour-intensive step meant to avoid accidentally erasing legitimate variant scans that carry different conservation value.
ICUB has also confirmed it will revise its scanning protocol so that future digitisation projects in city facilities — including the Biblioteca de Catalunya on Carrer de l'Hospital and the Palau de la Virreina on La Rambla — feed into a single intake pipeline with deduplication applied before files enter the shared system. That change is expected to take effect by September 2026, ahead of a planned autumn campaign to digitise photographic collections from the 1992 Olympic Games era.
For anyone who relies on the Arxiu Municipal's public portal, the practical advice for now is to use the direct catalogue reference numbers, which remain unique even when duplicate image files exist, rather than relying on keyword image searches, which are currently returning inconsistent results. The city's digital services team said on its status page this week that full portal performance should be restored by late July, once the first phase of deduplication is complete. The longer fix — getting every municipal department to speak the same digital language before a file ever reaches the archive — will take considerably longer.