ICAT - A pipeline to generate (and annotate) contracts datasets.

Datasets, Legal Tech, AI, Currently working · 29 Jun 2024

Table of Contents:

How it began?
Why collaboration?
The full feature set in development
A tool to query and add more data to the dataset

How it began?

When I began venturing into Legal Tech development, I knew that there is a strong hold of publishers over vast troves of legal data. Be it forms or precedents, judgment or statute, legal information has not been democratised the way it should have (at least for a field that so direly depends on the knowledge of all parties of "the law").

At the same time, I noticed that there is a large swathe of data that is in the public domain - that you can find in courts' judgments which serves the dual purpose of figuring out whether the matter discussed in the judgment is legally tenable and also as a source of the kind of text that can lead to that data.

Collecting data for AI/ML use cases is particularly important in a time when everyone is racing to build a Legal Tech solution, that in nearly all cases, relies on data.

I figured that a data extraction pipeline that collects this data which is in the public domain and demonstrates how this can be done openly would be beneficial, hence - ICAT

Why collaboration?

The ICAT pipeline is an interactive process that identifies user needs.
Because, learning is a two-way process - I am collecting responses to see what people need and giving them free inferencing on a model trained on an annotated dataset.
It is hoped that as time passes, more people will add their queries, and hopefully give insights or feedback on what they expect to be annotated.

The full feature set in development

1. Firstly, the dataset is available for anyone who wishes to train and inference it for their uses. It is shared under a Creative Commons - Attribution - ShareAlike License 4.0 under which you are mandated to similarly share any application which you develop using this.
Version 1 of this dataset is currently hosted on Hugging Face at the following link - https://huggingface.co/datasets/schematise/ICAT-version1

2. There is a public HuggingFace Space which contains an option to run a query pipeline that adds the results (Anonymously) to a relational database . You can even download this space and see what the automated pipeline for collecting data looks like.

3. If you like the dataset and/or query pipeline, you can view the data analytics platform hosted on DeepNote . This is essentially a notebook that I will be adding insights to periodically for public viewing.

Roadmap for the future?

Version 1 is a sample of what expert annotation can look like for business cases that involve more specification. To this end, I will be gauging response and publishing a contracts dataset as far as possible.
Contract similarity search from the eventually aggregated and ultimately annotated database.
Adding annotations, such as sentiments, ontological mapping, chatting with the dataset. Hence, for example, segregating portions of the contractual clauses on the basis of the legal predicates that govern their legality could be generated via an ontology, which I’m looking to convert documents to in an automated manner in my other projects.