Skip to content

trusthlt/lacour-corpus

Repository files navigation

LaCour! Corpus

Companion dataset to the arXiv preprint presenting the LaCour! corpus.

Please use the following citation

@article{held2023lacour,
    author = {Held, Lena and Habernal, Ivan},
    title = {{LaCour!: Enabling Research on Argumentation in Hearings of the European Court of Human Rights}},
    journal = {arXiv preprint},
    year = {2023},
    doi = {10.48550/arXiv.2312.05061},
}

Abstract Why does an argument end up in the final court decision? Was it deliberated or questioned during the oral hearings? Was there something in the hearings that triggered a particular judge to write a dissenting opinion? Despite the availability of the final judgments of the European Court of Human Rights (ECHR), none of these legal research questions can currently be answered as the ECHR's multilingual oral hearings are not transcribed, structured, or speaker-attributed. We address this fundamental gap by presenting LaCour!, the first corpus of textual oral arguments of the ECHR, consisting of 154 full hearings (2.1 million tokens from over 267 hours of video footage) in English, French, and other court languages, each linked to the corresponding final judgment documents. In addition to the transcribed and partially manually corrected text from the video, we provide sentence-level timestamps and manually annotated role and language labels. We also showcase LaCour! in a set of preliminary experiments that explore the interplay between questions and dissenting opinions. Apart from the use cases in legal NLP, we hope that law students or other interested parties will also use LaCour! as a learning resource, as it is freely available in various formats at https://huggingface.co/datasets/TrustHLT/LaCour.

Contact person: Lena Held, [email protected]

tl;dr

📖 Reading some ECHR hearing transcripts? LaCour! Preview
🤗 Dataset convenient and easy usage Huggingface Dataset
🔽 Download the individual transcript files .txt .xml
🔽 Download the documents meta data documents
👩‍💻 Creation code for reproduction trusthlt/lacour-generation
⁉️ Questions and opinions dataset trusthlt/lacour-qando

Data

The dataset consists of 2 subsets.

Subset transcripts

The first subset transcripts contains the 154 transcripts of court hearings. It is provided in 2 different formats, .xml or .txt. All text and information is the same in both formats.

Files in .txt format have the following structure:

[[Announcer;UNK]]

<<22.32;23.16;fr>>
La Cour!

[[]] denotes a segment with the information Role and Name for the speaker, <<>> marks snippets with a begin, end and language tag, followed by the text.

Files in .xml format have the following structure:

<?xml version='1.0' encoding='utf-8'?>
<Transcript>
  <WebcastID>2438419_29092021</WebcastID>
	<SpeakerSegment>
		<Role>Announcer</Role>
		<Name>UNK</Name>
		<Snippet>
			<Language>fr</Language>
			<TimestampBegin>16.5</TimestampBegin>
			<TimestampEnd>17.1</TimestampEnd>
			<Text>La Cour!</Text>
		</Snippet>
    ...
	</SpeakerSegment>
  ...
</Transcript>

We provide this nested format to make potential annotation tasks easier.

Both file formats contain the following information:

  • webcast_id: the identifier for the hearing (allows linking to documents)
  • Role: the role/party the speaker represents (Announcer for announcements, Judge for judges, JudgeP for judge president, Applicant for representatives of the applicant, Government for representatives of the respondent government, ThirdParty for representatives of third party interveners)
  • Name: the name of the speaker (not given for Applicant, Government or Third Party)
  • Begin: the timestamp for begin of line (in seconds)
  • End: the timestamp for end of line (in seconds)
  • Language: the language spoken (in ISO 639-1)
  • text: the spoken line

Subset documents

The second subset documents contains information on all relevant documents found in the HUDOC database which have a link to a webcast hearing. This link is established by the application number associated with the hearing and a case. To link transcripts with these documents, the webcast_id can be used. Each instance in documents represents information on a document in hudoc associated with a hearing and the metadata associated with a hearing. Note: hearing_type states the type of the hearing, type states the type of the document. If the hearing is a "Grand Chamber hearing", the "CHAMBER" document refers to a different hearing.

 '4': {
    'webcast_id': '2438419_29092021',
    'hearing_date': '2021-09-29 00:00:00',
    'hearing_title': 'H.F. and M.F. v. France and J.D. and A.D. v. France (nos. 24384/19 and 44234/20)',
    'hearing_type': 'Grand Chamber hearing',
    'appno': '44234/20',
    'case_id': '001-219333',
    'case_name': 'CASE OF H.F. AND OTHERS v. FRANCE',
    'case_url': 'https://hudoc.echr.coe.int/eng?i=001-219333',
    'type': 'GRANDCHAMBER',
    'typedescription': 15,
    'document_date': '2022-09-14 00:00:00',
    'collection': 'CASELAW;JUDGMENTS;GRANDCHAMBER;ENG',
    'importance': 1,
    'court': '8',
    'issue': 'Inter-ministerial instruction no. 5995/SG of 23 February 2018 on “Provisions to be made for minors on their return from areas of terrorist group operations (in particular the Syria-Iraq border area)”',
    'represented_by': 'DOSÉ M.',
    'respondent': 'FRA',
    'articles': '1;34;35;35-3-a;41;46;46-2;P4-3;P4-3-2',
    'strasbourg_caselaw': 'Abdi Ibrahim v. Norway [GC], no. 15379/16, § 180, 10 December 2021;Abdul Wahab Khan v. the United Kingdom (dec.), no. 11987/11, §§ 27-28, 28 January 2014;Airey v. Ireland, 9 October 1979, §§ 24-25, Series A no. 32;Al-Dulimi and Montana Management Inc. v. Switzerland [GC], no. 5809/08, §§ 134 and 145-146, 21 June 2016;[...]',
    'external_sources': 'Article 12 § 4 of the International Covenant on Civil and Political Rights (ICCPR);United Nations Human Rights Committee’s (UNCCPR) General Comment no. 27 on the Freedom of Movement under Article 12 of the ICCPR, adopted on 1 November 1999 (UN Documents CCPR/C/21/Rev.1/Add.9);Article 19 of the International Law Commission (ILC) Draft Articles on Diplomatic Protection and commentary;[...]',
    'conclusion': 'Preliminary objection dismissed (Art. 34) Individual applications;(Art. 34) Locus standi;Remainder inadmissible (Art. 35) Admissibility criteria;(Art. 35-3-a) Ratione loci;(Art. 35-3-a) Ratione personae;Violation of Article 3 of Protocol No. 4 - Prohibition of expulsion of nationals (Article 3 para. 2 of Protocol No. 4 - Enter own country);Respondent State to take individual measures (Article 46-2 - Individual measures);Non-pecuniary damage - finding of violation sufficient (Article 41 - Non-pecuniary damage;Just satisfaction)',
    'separate_opinion': 'TRUE',
    'judges': "Ganna Yudkivska;Jon Fridrik Kjølbro;Krzysztof Wojtyczek;Mārtiņš Mits;Robert Spano;Síofra O'Leary;Stéphanie Mourou-Vikström;Yonko Grozev;Georges Ravarani;Ksenija Turković;Lorraine Schembri Orland",
    'ecli': 'ECLI:CE:ECHR:2022:0914JUD002438419'
    }

The fields in documents are:

  • id: the identifier
  • webcast_id: the identifier for the hearing (allows linking to transcripts)
  • hearing_date: the date of the hearing
  • hearing_title: the title of the hearing
  • hearing_type: the type of hearing (Grand Chamber, Chamber or Grand Chamber Judgment Hearing)
  • appno: the application number which is associated with the hearing and case
  • case_id: the id of the case
  • case_name: the name of the case
  • case_url: the direct link to the document
  • type: the type of the document
  • typedescription: the exact identifier of the document type (distinction between e.g. Merits and Just Satisfaction, no key provided)
  • document_date: the date of the document
  • collection: the categorization of the document, i.e. type of document, type of chamber, language
  • importance: the importance score of the case (1 is the highest importance, key case)
  • court: the identifier for the court that issued the document
  • issue: the references to the issue of the case
  • represented_by: the person(s) representing the applicant(s)
  • respondent: the code of the respondent government(s) (in ISO-3166 Alpha-3)
  • articles: the concerning articles of the Convention of Human Rights
  • strasbourg_caselaw: the list of cases in the ECHR which are relevant to the current case
  • external_sources: the relevant references outside of the ECHR
  • conclusion: the short textual description of the conclusion
  • separate_opinion: the indicator if there is a separate opinion
  • judges: the judges appearing in the associated document
  • ecli: the ECLI (European Case Law Identifier)

Usage

Loading transcripts

XML

The xml format is nested and can be loaded e.g. with the function provided in load_lacour.py.

from load_lacour import load_transcript
from glob import glob
import pandas as pd

transcripts = []
for tf in glob('transcripts-xml/*.xml'):
    t, w = load_transcript(tf, format='xml')
    transcripts += t

df = pd.DataFrame(transcripts)

TXT

To load the txt files, you can use load_lacour.py:

from load_lacour import load_transcript
from glob import glob
import pandas as pd

transcripts = []
for tf in glob('transcripts-txt/*.txt'):
    t, w = load_transcript(tf)
    transcripts += t

df = pd.DataFrame(transcripts)

Loading document meta data

Load the .json file, i.e.

import pandas as pd
df = pd.read_json('lacour_linked_documents.json', orient='index', dtype={'webcast_id':str})

or

import json
with open('lacour_linked_documents.json') as f:
    d = json.load(f)

Questions and Opinions

The companion dataset for the experimental part using questions asked during the hearings and dissenting or concurring opinions can be found in the repository trusthlt/lacour-qando.

Data creation

Companion code for the creation of this dataset is available in the repository trusthlt/lacour-generation.