Barcelona's Arxiu Municipal de Barcelona confirmed this week that it has launched a dedicated cleanup operation targeting thousands of duplicate and mislabelled image files that have accumulated across its digital collections since the first mass digitisation push began in 2009. The problem, long acknowledged internally, has now reached a scale that is actively hampering public access to historical records.
The issue matters now because the archive is in the middle of a broader digitalisation drive tied to the city's 2025–2028 Digital Barcelona Plan, which commits the Ajuntament de Barcelona to open, searchable public data. Duplicate images inflate storage costs, produce false results in public search tools, and in some cases have caused the same photograph to be attributed to different dates or locations — a particular problem for records relating to the Eixample district and the Gothic Quarter, two areas with the heaviest photographic documentation going back to the late nineteenth century.
The operational hub for the project sits at the Arxiu Municipal Contemporani, on Carrer de Sant Pau in the Raval neighbourhood, which holds the largest single collection of twentieth-century municipal photographs. Staff there are working alongside technicians from the Institut Municipal d'Informàtica, the city's own IT agency, to run automated deduplication software across an estimated 340,000 digitised items. A parallel manual review is focused on approximately 12,000 files flagged as high-priority — images tied to legal records, urban planning permits, and cultural heritage listings where a cataloguing error carries real administrative consequences.
How Duplicates Built Up
The root cause is straightforward: successive digitisation campaigns over nearly two decades used different file-naming conventions, metadata standards, and scanning resolutions. When collections were merged into a single content management system — a process that accelerated between 2018 and 2022 — matching logic failed to catch near-identical images that had been scanned twice at different resolutions or cropped slightly differently. The result was a catalogue where a single 1960s photograph of the Mercat de Sant Antoni renovation could appear under three separate reference numbers with three different credited dates.
The Sant Antoni market case is not hypothetical. It has been cited internally as a textbook example of how the problem compounds: researchers requesting images for academic publication, including from institutions such as the Universitat de Barcelona's history faculty on Gran Via de les Corts Catalanes, have on at least two occasions in the past three years received duplicate files without realising it, leading to corrections in published work.
Storage is a measurable cost. Municipal IT procurement documents from 2024 put the annual bill for archive cloud storage at just under €280,000. Officials have indicated — without giving a precise figure — that eliminating confirmed duplicates could trim that figure meaningfully before the next budget cycle, which begins in January 2027.
What Comes Next for Researchers and the Public
The deduplication project is expected to run through the end of October 2026. Once the automated phase is complete, the Arxiu Municipal plans to update its public search portal — accessible at arxiu.barcelona.cat — with corrected metadata and consolidated file entries. Users who have saved direct links to specific archive images should expect some URLs to change when duplicate records are merged and a single canonical entry is kept.
For anyone with ongoing research projects dependent on the archive, the practical advice is to download and locally save any images already in use, noting the current reference numbers. The archive's reading room on Carrer de Sant Pau will remain open during the process, and staff are handling queries about specific collections on a case-by-case basis.
The wider lesson is one that other European city archives have been wrestling with for years. Madrid's Archivo de Villa completed a similar deduplication exercise in 2023. Barcelona's effort is more complex by volume, but the methodology being applied — a hybrid of perceptual hashing algorithms and human review — is now considered standard practice. The goal is a cleaner, faster public catalogue by the time Barcelona hosts its next major wave of visitors. That deadline is self-imposed, and the archive team is not getting more staff to meet it.