Storing data

From imdb
Jump to: navigation, search

We have two disks:

  • /scratch (5.0T, not backed up)
  • /data_bck (500G, backed up by IT)

Each of these have a private and a public folder.

Data that is shared should be in the public folder, user individual project should go in the private folder.

Public data should be documentet in this wiki.

Set up access right to comply with this.

Do not store large things in your home-folder. Though we use our university accounts to log in, the $HOME folder on the system is not the same as your $HOME on the university system, but part of the storage for the OS.


From the README file

  • /scratch

scratch file space is dedicated only to temporary data. Data can be erased on regular intervals without any prior notification. It can not be used for permanent data storage. Data on /scratch will NOT be backed up.

  • /data_bck

For long term storage and backup please see README on /data_bck disk.

Sharing data

Some projects generate data that can be used (in agreement with data owners) by other researchers and perhaps students. That data needs to be documented so future users know what the data is, terms for use, who generated or collected it and how it was collected.

Data collection is expensive. It takes code, storage and man hours. If your project has data it can share, please do. It makes bootstraping a new project easier, lets researchers and students quickly test ideas and in the best case scenario more projects and publications can emerge from the same data. Also: please cite where data comes from - data collectors like to be recognised.

  • Twitter data

Location: /data_bck/public/data_twitter_valgkamp
Desc: SQL-dump of tweets from Norwegian elections (hastags like #valg, etc.). Collected with yourtwapperkepper.
People: Hallvard Moe has published on this data, so has Eirik Stavelin. Ask them.

  • Newspaper articles

Location: /data_bck/public/data_ntb_nak
Desc: Norwegian newspaper articles. Sorted by news category. Collected by Norsk Avis Korpus (NAK) and Norsk TelegrambyrÄ (NTB).
People: Dag Elgesem, Eirik Stavelin (has cleaned this data from xml to plain text, and auto categorised the NAK). And NAK/NTB.


Community Database

Nelson is running a mongodb which can only be accessed from localhost. The database files are stored on scratch (/scratch/mongodb) and should be considered volatile. If you need backup you need to make sure to make regular dumps when and to where this is appropriate. Although the data is not to be considered securely stored, do not alter other peoples data without consent.

Current collections

graball 
a db called "graball" contains a collection called "urls" which contains urls hit by the graball system in the diversity project