From imdb
Jump to: navigation, search


The Diversity project (or "Mangfold") is a data collection, cleaning and analysis project. The data collection is over (for now) and has yielded over 400 GB of html files collected from 189 different news outlets over three months.

Access to the web interface

Much of the data associated with this project is available online at or Currently all cites with meaningful content require login. If you need access, contact Truls Pedersen and give him your UIB username.

Raw Data

The raw data is available on the nelson server, but shouldn't be too interesting to anybody. Please contact Truls Pedersen if you need this data.

Cleaned Data

The cleaning phase has started, and we will make data available as we progress. If you find substantial or systematic errors in any of the data, please expand the comment on the resource page of a/the erroneously cleaned document (i.e., append your report onto the existing comment). This will help us produce better data in the future. How to find the resource page will be described below.


This dicrectory contains dumps of cleaned data from the Mangfold project. If you want to use this data (except for tentative explorations), please contact Helle Sjøvaag or Truls Pedersen.

The files come in pairs:

* <SITE TAG>.<DATE>.json, and
* <SITE TAG.<DATE>.url.json

They are dumps (produced on <DATE>) representing the stipulated data extraction from the given site <SITE TAG>. The first file contains a doc_id-indexed json with analyses. The second file contains a doc_id-indexed json with location information.

Please note that these dumps may be replaced by new dumps at any time and without warning. If you need consistency beyond this, you must make copies elsewhere of the files you need. This entails that the filenames will change, but the general pattern described above will be maintained.

Each entry's doc_id field correspond to the unique identifying code in the mangfold project associated with the document from which the analysis was obtained.

If an analysis entry, of some doc_id, contains a key <attr> (except "time", "date" or "tag"), it means that the mangfold scraper has found the attribute <attr> in the document associated with that doc_id and that the value of this attribute is the value associated with that key in the analysis entry.

The keys "time", "date" and "tag" are exceptions. These are generated by the scraper/cleaner and can be ignored.

The location information contains three entries. The json in the second file contains the following entries for every doc_id for the given site:

* url : Original (sometimes marginally cleaned) URL of the document, 
* loc : Path to location of the original scraped HTML file, and
* res : URL of this document's resource page

On the resource page, you can leave comments about errors you encounter with respect to the cleaning. Note however, that the cleaning is based on the scraped HTML and *not* the live document. Discrepancy between the result of cleaning and the live version are to be expected. In order to access the resource page, your UIB username must be added to the list of approved users. Please contact Truls Pedersen.

The original files (found at the local path) are sensitive. Please do not modify. The entire raw corpus (organized by date) can also be found on


There is a program in this folder ( to show how the data in this folder can be used.

Example use (python 3):

python -a AP_25.05.2016.json -u AP_25.05.2016.url.json


Domain distribution:
 DOMAIN                    | OCCURRENCES
-------------------------- | -------------------- |       4          |      47 |     451         |   12365              |    1364

Attributes distribution:
 ATTR            | OCCURRENCES
---------------- | --------------------
body             |    1728
meta (OG)        |   13913
meta (Twitter)   |   13453
meta (general)   |   13994
oppdateringsdato |    1728
page type        |     237
publiseringsdato |    1728
title            |   14231

There are 14231 documents for this tag.

If you find a substantial error, please leave a comment at the erroneously analysed document's resource page.

The above example shows some shortcomings. When the cleaning process applied to this particular site (AP) matures, we expect more documents to have a body attribute. If you encounter a document which actually has a body (in the scraped version), but not in the analysis, this would most likely constitute a systematic error and should be reported on that document's resource page.

Current dumps

Currently available dumps
Analyses URLs
AP_25.05.2016.json AP_25.05.2016.url.json
AvisenAgder_25.05.2016.json AvisenAgder_25.05.2016.url.json
aasavis_25.05.2016.json aasavis_25.05.2016.url.json
adressa_25.05.2016.json adressa_25.05.2016.url.json
aftenposteninnsikt_25.05.2016.json aftenposteninnsikt_25.05.2016.url.json
agderposten_25.05.2016.json agderposten_25.05.2016.url.json
akershusamtstidende_25.05.2016.json akershusamtstidende_25.05.2016.url.json
altaposten_25.05.2016.json altaposten_25.05.2016.url.json
andalsnesavis_25.05.2016.json andalsnesavis_25.05.2016.url.json
andoyposten_25.05.2016.json andoyposten_25.05.2016.url.json
arbeidetsrett_25.05.2016.json arbeidetsrett_25.05.2016.url.json
arendalstidende_25.05.2016.json arendalstidende_25.05.2016.url.json
asanetidende_25.05.2016.json asanetidende_25.05.2016.url.json
askoyveringen_25.05.2016.json askoyveringen_25.05.2016.url.json
auraavis_25.05.2016.json auraavis_25.05.2016.url.json
austagderblad_25.05.2016.json austagderblad_25.05.2016.url.json
avisahemnes_25.05.2016.json avisahemnes_25.05.2016.url.json
avisanordland_25.05.2016.json avisanordland_25.05.2016.url.json
ba_25.05.2016.json ba_25.05.2016.url.json
bladetvesteralen_25.05.2016.json bladetvesteralen_25.05.2016.url.json
bomlonytt_25.05.2016.json bomlonytt_25.05.2016.url.json
bronnoysundavis_25.05.2016.json bronnoysundavis_25.05.2016.url.json
bt_25.05.2016.json bt_25.05.2016.url.json
budstikka_25.05.2016.json budstikka_25.05.2016.url.json
bygdanytt_25.05.2016.json bygdanytt_25.05.2016.url.json
bygdebladrandaberg_25.05.2016.json bygdebladrandaberg_25.05.2016.url.json
bygdeposten_25.05.2016.json bygdeposten_25.05.2016.url.json
dagbladet_25.05.2016.json dagbladet_25.05.2016.url.json
dagen_25.05.2016.json dagen_25.05.2016.url.json
dagensnaeringsliv_25.05.2016.json dagensnaeringsliv_25.05.2016.url.json
nrk_25.05.2016.json nrk_25.05.2016.url.json


This phase has not yet started.