About this Guide
This guide is intended for people who want to implement a connection with the Repository Junction Broker service.
If you have further questions about the service that are not answered by the guide please use the UK RepositoryNet+ Helpdesk contact form or send email direct to: support@repositorynet.ac.uk
About the RJ Broker
For an overview of the broker please refer to the RJB: User Manual.
RJ Broker Architecture
The broker is itself a repository that uses SWORD v1.3 to receive and transmit records.
Deposit Records
Suppliers deposit records into the broker using an agreed package format (see here), which the broker then unpacks.
Parse Record Metadata
The broker parses the supplied metadata to identify organisations, and therefore repositories, that this record should probably be sent to.
The broker is reliant on the Organisation and Repository Identification Service to make this practical. As part of the repository identification process, the broker also marks those repositories which have subscribed to the service.
Transfer Record
Periodically (initially daily), the broker finds all records that have subscribing repositories, but have not had the record transferred to those Repositories. The broker transfers the record, and notes when the successful transfer took place, and the URI the target repository has given for the deposit.
Check Record Alive
Most repositories have a review and/or add curation process before making records live. The broker has a seperate process, run on a daily basis, that looks at all recently transferred records to see if they are visible via the given URI and notes when they are alive.
Supplier Engagement
For suppliers, engaging with the broker is straightforward:
- The supplier and the broker need to agree a packaging format for the supplier to deposit records into the broker.
-
RepNet+ works with the supplier to develop a bespoke Importer, a process that will involve a development/test cycle to ensure everything works correctly.
- Taking the lead from the PEER project the broker uses the NLM-DTD format or the suppliers own e.g. Europe PMC.
-
The broker will return metadata that will enable the supplier to track onward transmission of the record
- The supplier may elect to do something with the return data
- There is precedence that an email is sent to a nominated email address, listing the last 24hrs of transfers.
- Agreement my be needed with the supplier for the support of specific features. For example the RJ Broker can support embargos.
Repository Engagement
For an Individual Repository (IR) engaging with the broker is very easy:
- The IR needs to sign up, which simply requires allowing the broker to deposit into the repository via SWORD v1.3.
-
Some records require an additional contractual agreement to honour any embargo periods.
- As an example, Nature Publishing Group send full-text records to the broker at the time of publication, however those records are covered by a six-month embargo.
- The full text files are only released to IRs that have agreed to honour that embargo period.
- RepNet have working importers for DSpace v3 and EPrints v3.2.n repositories, which can be easily modified for any personalisation an IR may have made as part of their agreement.
- The IR SWORD deposit routine must returns a full URI for each deposited item: early EPrints installs only return an internal ID.
Supplier Requirements
Since each supplier has their own set of metadata fields, the broker uses bespoke importers for each supplier: this allows it to receive data in the format that is best suited to the supplier/broker relationship and allows it to tag imports with the provenance of the depositing user. As the importers are unique to each supplier, the method for identifying the target Individual Repositories can also be tailored.
Organisation/Repository Identification
There are several options for identifying target repositories:
- The broker can scan the metadata for postal addresses and/or email addresses, and use the Organisation and Repository Identification (ORI) service to create a list of identifiable organisations, and therefore potential repositories.
- The supplier can define a list of repositories, which the broker then just uses.
- The supplier can provide a list of MUST repositories, and allow the broker to augment the list with POTENTIAL repositories.
Whilst it is not possible to prescribe a set of fields that must exist, we can show fields that definitely work. From the NLM-DTD, each contributor (author) has an associated affiliation record:
<contrib contrib-type='author' corresp='yes'> <name> <surname>Picus viridis</surname> <given-names>Yaffle</given-names> </name> <xref ref-type="aff"rid="r1">1</xref> <xref ref-type="corresp" rid="cor1">*</xref> </contrib> . . . <aff id="r1"> <addr-line>The University or Edinburgh, 160 Causewayside, Edinburgh. EH9 1PR</addr-line> <country>United Kingdom</country> <institution>UK RepositoryNet+</institution> </aff>
This can be parsed, and we know find “University or Edinburgh” and “UK RepositoryNet+” as organisations.
Likewise, Europe PubMed Central just has a single institution identified. The broker can parse:
<affiliation>Arthritis Research UK Epidemiology Unit, Manchester Academic Health Science Centre, University of Manchester, Manchester, UK.</affiliation>
This would identify Arthritis Research UK and University of Manchester as possible organisations, which can be looked up in ORI in order
to retrieve associated repositories.
Bibliographic Metadata
In terms of bibliographic metadata, the list of “required” fields is small:
- Title
- Authors
- Journal/Publication
Pretty much everything else can be defaults (deposit is a letter/manuscript; publication date is today; item is not refereed; there are no documents, DOIs, or URI references to documents; no abstract; etc) and the publisher details can be deduced from the journal using the Sherpa/RoMEO service.
Records without identifiable organisations/repositories
Whilst such records are not useable by the broker directly, their inclusion makes them available to third-party services based on the data in the broker.
- List all records attributed to a particular Funding Body
- List all records attributed to a particular Grand Code
- List all records with a particular Author
Repository Requirements
The broker deposits a standard package to all repositories, with the intent that this format can be easily adopted by others, making a defacto standard interchange format.
Terminology
| Record | - a Deposited Object |
| Object | - the whole thing, a complete record. |
| Metadata | - the descriptive information about the object. |
| Document | - something end users want to read. May be a combination of multiple files (eg: a web page). |
| Binary Object | - a thing end users want to read/view, be it a document/jpeg/spreadsheet. Also called a file. |
| File | - a file. |
Basic Overview
The basic unit is a .zip file. This file will contain at least one file, mets.xml containing the metadata, and may contain any number
of additional files, with each document in its own directory.
This format was chosen to allow the deposit of an Object which describes a broker Deposit record.... which, per-force,
contains a Binary Object that is called mets.xml; a flat system would not allow this.
Depending on embargos and subscriptions, the broker may attach the original deposited file, which allows Individual Repositories to mine that Object for additional data that is not given in the metadata.
Where documents exists, the last document will always be the original deposit item.
The Metadata Description
The basic metadata file, mets.xml is a METS file, with
the record metadata encoded in Eprints-DC-XML (epdcx). See SWAPand
epcdx for further details.
A basic METS package has 4 significant sections, in the following order:
| dmdDec | The Descriptive Metadata Section. |
| amdSec | Where the administrative (i.e. embargo) information is defined, using the same DCMI Abstract Model that epcdx uses. |
| fileSec | Lists all the files containing content which comprise the electronic versions of the digital object. |
| structMap | Where the structure of the files is described: which files are grouped together, and the embargo details on those files. |
dmdDec
This is the main metadata section, and uses the epcdx model developed by JISC. This is heavily based on the SWAP model:
| ScholarlyWork | type; title; abstract; identifer (the publishers ID); creator; affilitated institution (possibly from authors); funder; GrantCode; isExpressedAs |
| Expression | type; identifer (doi and/or url-at-broker); date (published:yyyy-mm-dd); status (peer reviewed, etc); copyright_holder; citation; references; isManfiestAs |
| Manifestation | publication; publisher; issn; isbn; volume; issue; pagerange (first-last); AccessRights (open/restricted/closed); License; availableAs (doi/official_url/related_urls) |
| Agent | type; name or family-name & given-name; mailbox; (additional oarj namespace) org_name; ori_id |
Sections Details
For the ScholarlyWork, the broker currently exports:
- type: ScholarlyWork
-
<epdcx:statement epdcx:propertyURI="http://purl.org/dc/elements/1.1/type" epdcx:valueURI="http://purl.org/eprint/entityType/ScholarlyWork"/>
- identifier: the identifier number as given by the supplier
-
<epdcx:statement epdcx:propertyURI="http://purl.org/dc/elements/1.1/identifier"> <epdcx:valueString>2011-12-06508</epdcx:valueString> </epdcx:statement>
- title: title of the deposit record
-
<epdcx:statement epdcx:propertyURI="http://purl.org/dc/elements/1.1/title"> <epdcx:valueString>Teddy Bear Programming: Fact or Fiction?</epdcx:valueString> </epdcx:statement>
- creators: The names of the authors
-
This record is slightly complex, as there is a reference to a fuller (Agent) record:
<epdcx:statement epdcx:propertyURI="http://purl.org/dc/elements/1.1/creator" epdcx:valueRef="IanStuart"> <epdcx:valueString>Stuart, Ian</epdcx:valueString> </epdcx:statement>
- abstract: the abstract
-
<epdcx:statement epdcx:propertyURI="http://purl.org/dc/terms/abstract"> <epdcx:valueString>Some text here....</epdcx:valueString> </epdcx:statement> - affiliated institution: any affiliated institutions (as identified via creator's affiliation).
-
<epdcx:statement epdcx:propertyURI="http://purl.org/eprint/terms/affiliatedInstitution"> <epdcx:valueString>EDINA</epdcx:valueString> </epdcx:statement>
- Grant and Funder information:These two fields repeat as needed.
-
<epdcx:statement epdcx:propertyURI="http://www.loc.gov/loc.terms/relators/FND" epdcx:valueRef="funder Arthritis Research UK"/> <epdcx:statement epdcx:propertyURI="http://purl.org/eprint/terms/grantNumber" epdcx:valueRef="grant P30-AR-473639"/>
As with creators, the valueRef attributes are references to descriptions later on, however if you remove the “grant ” or “funder ” (notice the space character) from the start of the string, it is the proper value for that item. The epdcx structure does not relate the funders and the grants – these are defined is later descitions. - isExpressedAs: Link to the Expression description.
-
<epdcx:statement epdcx:propertyURI="http://purl.org/eprint/terms/isExpressedAs" epdcx:valueURI="sword-mets-expr-1"/>
Within the Manifestation section, the broker currently exports:
- type:
-
<epdcx:statement epdcx:propertyURI="http://purl.org/dc/elements/1.1/type" epdcx:vesURI="http://purl.org/eprint/terms/Type" epdcx:valueURI="http://purl.org/eprint/entityType/Manifest"/>
- publication: The title of the journal or site the record was published in.
-
<epdcx:statement epdcx:propertyURI="http://opendepot.org/broker/elements/1.0/publication"> <epdcx:valueString>Acme News</epdcx:valueString> </epdcx:statement>
-
issn
isbn
volume
issue
pagerange -
These all generally follow the same structure:
<epdcx:statement epdcx:propertyURI="http://opendepot.org/broker/elements/1.0/issn"> <epdcx:valueString>12775</epdcx:valueString> </epdcx:statement> - accessrights: OpenAccess, RestrictedAccess or ClosedAccess
-
ClosedAccess will have have no embargoed files; RestrictedAccess documents will have embargo details in the structMap and amdSec sections
<epdcx:statement epdcx:propertyURI="http://purl.org/dc/elements/1.1/type" epdcx:valueURI="http://purl.org/eprint/accessRights/RestrictedAccess"/> - isAvailableAs: This is where other copies of the record are available. There are two categories: available-official_url and available-related_url. These element actually refer to fuller description elements later in the document.
-
<epdcx:statement epdcx:propertyURI="http://purl.org/eprint/terms/isAvailableAs" epdcx:valueRef="available-related_url-2"/> <epdcx:statement epdcx:propertyURI="http://purl.org/eprint/terms/isAvailableAs" epdcx:valueRef="available-related_url-1"/> <epdcx:statement epdcx:propertyURI="http://purl.org/eprint/terms/isAvailableAs" epdcx:valueRef="available-official_url-1"/>
After the main metadata descriptions, the broker lists the various explanation descriptions:
External copies and other versions
Most notable by the Gold Access suppliers, records can have a reference to copies hosted elsewhere. This is where those links are described.
There are two types of link: Europe PubMed Central provides a link to the official record within its data-set, and then there are related URLs, which are anything else.
Where possible, each set will contain the following:
- the description element will list the actual URL in the resourceURI attribute.
- there will be an accessRights statement.
- a statement listing the site (or organisation) hosting that copy, and
- a statement giving the format of the document the that link refers to (pdf, html, etc.).
<epdcx:description
epdcx:resourceId="available-official_url-1"
epdcx:resourceUrl="http://europepmc.org/articles/PMC3402849">
<epdcx:statement
epdcx:propertyURI="http://purl.org/dc/elements/1.1/type"
epdcx:valueURI="http://purl.org/eprint/entityType/Copy"/>
</epdcx:description>
<epdcx:description
epdcx:resourceId="available-related_url-1"
epdcx:resourceUrl="http://www.pubmedcentral.org/articles/PMC3402849/pdf/?tool=EBI">
<epdcx:statement
epdcx:propertyURI="http://purl.org/dc/elements/1.1/type"
epdcx:valueURI="http://purl.org/eprint/entityType/Copy"/>
<epdcx:statement
epdcx:propertyURI="http://purl.org/dc/terms/accessRights"
epdcx:valueURI="http://purl.org/eprint/accessRights/openAcess">
<epdcx:valueString>Free</epdcx:valueString>
</epdcx:statement>
<epdcx:statement
epdcx:propertyURI="http://opendepot.org/reference/rjb/site">
<epdcx:valueString>PubMedCentral</epdcx:valueString>
</epdcx:statement>
<epdcx:statement
epdcx:propertyURI="http://opendepot.org/reference/rjb/format">
<epdcx:valueString>pdf</epdcx:valueString>
</epdcx:statement>
</epdcx:description>
<epdcx:description
epdcx:resourceId="available-related_url-2"
epdcx:resourceUrl="http://europepmc.org/articles/PMC3402849?pdf=render">
<epdcx:statement
epdcx:propertyURI="http://purl.org/dc/elements/1.1/type"
epdcx:valueURI="http://purl.org/eprint/entityType/Copy"/>
<epdcx:statement
epdcx:propertyURI="http://purl.org/dc/terms/accessRights"
epdcx:valueURI="http://purl.org/eprint/accessRights/openAcess">
<epdcx:valueString>Free</epdcx:valueString>
</epdcx:statement>
<epdcx:statement
epdcx:propertyURI="http://opendepot.org/reference/rjb/site">
<epdcx:valueString>Europe_PMC</epdcx:valueString>
</epdcx:statement>
<epdcx:statement
epdcx:propertyURI="http://opendepot.org/reference/rjb/format">
<epdcx:valueString>pdf</epdcx:valueString>
</epdcx:statement>
</epdcx:description>
Funders and grant codes
The RIOXX schema defines a Funder element and a Grant-Code element, with no actual link between them. The relationship between the funder and the grants is defined in these descriptions. For all descriptions, the resourceId is the reference value defined earlier in the metadata.
- Funder. Note how multiple grants are listed within a single funder, where that is appropriate:
-
<epdcx:description epdcx:resourceId="funder NIAMS NIH HHS"> <epdcx:statement epdcx:propertyURI="http://www.loc.gov/loc.terms/relators/FND"> <epdcx:valueString>NIAMS NIH HHS</epdcx:valueString> </epdcx:statement> <epdcx:statement epdcx:propertyURI="http://purl.org/eprint/terms/grantNumber"> <epdcx:valueString>K23-AR-50177</epdcx:valueString> </epdcx:statement> <epdcx:statement epdcx:propertyURI="http://purl.org/eprint/terms/grantNumber"> <epdcx:valueString>N01-AR-42272</epdcx:valueString> </epdcx:statement> </epdcx:description> - Grant. Although we've not come across it yet, this description does allow for multiple funders supporting a single grant.
-
<epdcx:description epdcx:resourceId="grant N01-AR-42272"> <epdcx:statement epdcx:propertyURI="http://purl.org/eprint/terms/grantNumber"> <epdcx:valueString>N01-AR-42272</epdcx:valueString> </epdcx:statement> <epdcx:statement epdcx:propertyURI="http://www.loc.gov/loc.terms/relators/FND"> <epdcx:valueString>NIAMS NIH HHS</epdcx:valueString> </epdcx:statement> </epdcx:description> <epdcx:description epdcx:resourceId="grant K23-AR-50177"> <epdcx:statement epdcx:propertyURI="http://purl.org/eprint/terms/grantNumber"> <epdcx:valueString>K23-AR-50177</epdcx:valueString> </epdcx:statement> <epdcx:statement epdcx:propertyURI="http://www.loc.gov/loc.terms/relators/FND"> <epdcx:valueString>NIAMS NIH HHS</epdcx:valueString> </epdcx:statement> </epdcx:description>
Agents
Each Agent is defined in their own description section.
- Note that there is a reference to the creator element in the ScholarlyWork description:
-
<epdcx:description epdcx:resourceID="IanStuart"> <epdcx:statement epdcx:propertyURI="http://purl.org/dc/elements/1.1/Type" epdcx:vesURI="http://purl.org/dc/elements/1.1/Person"/> - Given name & Family name
-
<epdcx:statement epdcx:propertyURI="http://purl.org/dc/elements/1.1/givenname"> <epdcx:valueString>Ian</epdcx:valueString> </epdcx:statement> <epdcx:statement epdcx:propertyURI="http://purl.org/dc/elements/1.1/familyname"> <epdcx:valueString>Stuart</epdcx:valueString> </epdcx:statement>
- Email address
-
<epdcx:statement epdcx:propertyURI="http://xmlns.com/foaf/0.1/mbox"> <epdcx:valueString>Ian.Stuart@ed.ac.uk</epdcx:valueString> </epdcx:statement>
- Address
-
<epdcx:statement epdcx:propertyURI="http://purl.org/eprint/terms/affiliatedInstitution"> <epdcx:valueString> EDINA, 160 Causewayside, Edinburgh. EH9 1PR. United Kingdom </epdcx:valueString> </epdcx:statement>
- Organisation and orgid code from the ORI service
-
<epdcx:statement epdcx:propertyURI="http://xmlns.com/foaf/0.1/name"> <epdcx:valueString>EDINA</epdcx:valueString> </epdcx:statement> <epdcx:statement epdcx:propertyURI="http://opendepot.org/reference/linked/1.0/identifier"> <epdcx:valueString>3199</epdcx:valueString> </epdcx:statement> </epdcx:description>
amdSec
This is the administrative section of the METS document. Currently it only contains the extended embargo information, using the DCMI Abstract Model.
Sample record:
<amdSec ID="sword-mets-adm-1" LABEL="administrative" TYPE="LOGICAL">
<rightsMD ID="sword-mets-amdRights-1">
<mdWrap MDTYPE="OTHER" OTHERMDTYPE="RJ-BROKER">
<xmlData>
<epdcx:descriptionSet
xmlns:epdcx="http://purl.org/eprint/epdcx/2006-11-16/"
xsi:schemaLocation="http://purl.org/eprint/epdcx/2006-11-16/
http://purl.org/eprint/epdcx/xsd/2006-11-16/epdcx.xsd ">
<epdcx:description epdcx:resourceId="sword-mets-div-3"
epdcx:resourceURI="http://devel.edina.ac.uk:1203/191/">
<epdcx:statement
epdcx:propertyURI="http://purl.org/dc/terms/available"
epdcx:valueRef="http://purl.org/eprint/accessRights/RestrictedAccess">
<epdcx:valueString
epdcx:sesURI="http://purl.org/dc/terms/W3CDTF">
2013-05-29
</epdcx:valueString>
</epdcx:statement>
</epdcx:description>
<epdcx:description epdcx:resourceId="sword-mets-div-4"
epdcx:resourceURI="http://devel.edina.ac.uk:1203/191/">
<epdcx:statement
epdcx:propertyURI="http://purl.org/dc/terms/available"
epdcx:valueRef="http://purl.org/eprint/accessRights/RestrictedAccess">
<epdcx:valueString
epdcx:sesURI="http://purl.org/dc/terms/W3CDTF">
2013-05-29
</epdcx:valueString>
</epdcx:statement>
</epdcx:description>
</epdcx:descriptionSet>
</xmlData>
</mdWrap>
</rightsMD>
</amdSec>
In essence, this defines the availability to be on some date, with a reason of Restricted Access.
Also notice that each description has a ressourceID which links it to the appropriate div in the structMap section:
<epdcx:description epdcx:resourceId="sword-mets-div-3"
epdcx:resourceURI="http://devel.edina.ac.uk:1203/191/">
<epdcx:statement
epdcx:propertyURI="http://purl.org/dc/terms/available"
epdcx:valueRef="http://purl.org/eprint/accessRights/RestrictedAccess">
<epdcx:valueString
epdcx:sesURI="http://purl.org/dc/terms/W3CDTF">
2013-05-29
</epdcx:valueString>
</epdcx:statement>
</epdcx:description>
fileSec
The METS section that details the files
Each file has its own record:
<file ID="eprint-191-document-123-0" GROUPID="sword-mets-fgid-123" SIZE="3670383" OWNERID="http://devel.edina.ac.uk:1203/191/" MIMETYPE="application/gif"> <FLocat LOCTYPE="URL" xlink:type="simple" xlink:href="123/Spectator_safety.gif"/> </file>
Flocate gives the location of the file, within the .zip archive (eg, the file “Spectator_safety.gif” within folder “123”, within the archive.)
Sample record:
<fileSec ID="sword-mets-file-1" LABEL="files">
<fileGrp ID="sword-mets-fgrp-1" USE="CONTENT">
<file ID="eprint-191-document-123-0" GROUPID="sword-mets-fgid-123"
SIZE="3670383" OWNERID="http://devel.edina.ac.uk:1203/191/"
MIMETYPE="application/gif">
<FLocat LOCTYPE="URL" xlink:type="simple"
xlink:href="123/Spectator_safety.gif"/>
</file>
<file ID="eprint-191-document-456-0" GROUPID="sword-mets-fgid-456"
SIZE="109601" OWNERID="http://devel.edina.ac.uk:1203/191/"
MIMETYPE="application/zip">
<FLocat LOCTYPE="URL" xlink:type="simple"
xlink:href="456/Broker_imported.zip"/>
</file>
<file ID="eprint-191-document-789-0" GROUPID="sword-mets-fgid-789"
SIZE="11083" OWNERID="http://devel.edina.ac.uk:1203/191/"
MIMETYPE="application/pdf">
<FLocat LOCTYPE="URL" xlink:type="simple"
xlink:href="789/pdf1.pdf"/>
</file>
<file ID="eprint-191-document-789-1" GROUPID="sword-mets-fgid-789"
SIZE="11278" OWNERID="http://devel.edina.ac.uk:1203/191/"
MIMETYPE="application/pdf">
<FLocat LOCTYPE="URL" xlink:type="simple"
xlink:href="789/pdf2.pdf"/>
</file>
<file ID="eprint-191-document-789-2" GROUPID="sword-mets-fgid-789"
SIZE="11323" OWNERID="http://devel.edina.ac.uk:1203/191/"
MIMETYPE="application/pdf">
<FLocat LOCTYPE="URL" xlink:type="simple"
xlink:href="789/pdf3.pdf"/>
</file>
<file ID="eprint-191-document-789-3" GROUPID="sword-mets-fgid-789"
SIZE="10752" OWNERID="http://devel.edina.ac.uk:1203/191/"
MIMETYPE="text/xml">
<FLocat LOCTYPE="URL" xlink:type="simple"
xlink:href="789/mets.xml"/>
</file>
</fileGrp>
</fileSec>
structMap
This is the section that shows how the files relate to each other.
As mentioned above, a document may consist of multiple files, therefore documents are in seperate directories. However one document may contain many files. Embargoes are applied at document level, so the embargo date is given as an attribute of document div.
Sample record:
<structMap ID="sword-mets-struct-1" LABEL="structure" TYPE="LOGICAL">
<div ID="sword-mets-div-1" DMDID="sword-mets-dmd-eprint-191"
TYPE="SWORD Object">
<div ID="sword-mets-div-2">
<fptr FILEID="eprint-191-document-123-0"/>
</div>
<div ID="sword-mets-div-3" oarj_embargo="2013-05-29">
<fptr FILEID="eprint-191-document-456-0"/>
</div>
<div ID="sword-mets-div-4" oarj_embargo="2013-05-29">
<fptr FILEID="eprint-191-document-789-3"/>
<fptr FILEID="eprint-191-document-789-0"/>
<fptr FILEID="eprint-191-document-789-1"/>
<fptr FILEID="eprint-191-document-789-2"/>
</div>
</div>
</structMap>
This samples shows three useful things:
- Of the file(s) in the first document
sword-mets-div-2is not embargoed: the other two are. This means the record will be “Restricted Access”. - The third document
sword-mets-div-4has four files whose order is significant, so document
eprint-191-document-789-3is considered the primary document. - The
FILEIDattribute of eachfptrelement refers to theIDattribute of the appropriate file element in thefileSecsection.