Collecting data plays a big role in Legal-Tech today!

Ideas for creating and utilising legal datasets

To go directly to a list of use-cases I’m in the process of developing, click here

The benefits of copyleft licensing, and where to find sources for legal datasets

What are the benefits of publishing a legal dataset with a Creative Commons license? It allows people to utilise your content as per your terms, and also increases the visibility of your content.

My law professors, and indeed many of those in the academic space have deliberated on the internet’s tendency to make memes or other forms of “derivative” content regardless of the license it utilises. Some platforms are increasingly adopting Digital Restrictions Management, but that does not in any way guarantee that content you prepared to be shared and accessed widely will generate monetisation. You need to set your terms for the utilisation of your work, whether for “derivative works”, trademark usage, or even straight up commercial use.

Now that that is apparent, we can also utilise datasets (or even data in its raw form) when it occurs in the wild in a permissive license. So, here’s a list of some sources I found helpful:

Section 52. *Certain acts to not be infringement of copyright*
(1) The following acts shall not constitute an infringement of copyright, namely,--
...
..
.
(q) the reproduction or publication of--
(i) any matter which has been published in any Official Gazette except an Act of a Legislature;
(ii) any Act of a Legislature subject to the condition that such Act is reproduced or published together with any commentary thereon or any other original matter;
(iii) the report of any committee, commission, council, board or other like body appointed by the Government if such report has been laid on the Table of the Legislature, unless the reproduction or publication of such report is prohibited by the Government;
(iv) any judgment or order of a court, tribunal or other judicial authority, unless the reproduction or publication of such judgment or order is prohibited by the court, the tribunal or other judicial authority, as the case may be;

What this means is that texts of laws when obtained from the eGazette - which has replaced the publication of the Gazette in the printed form, are free from copyright restrictions.

Another rich data source is the eCourts platform which provides all the judgments or orders in PDF forms. Although you will require some knowledge of processing case details for navigating this specific portal, I think that’s a task for a separate post altogether.

Use-cases which I am working on currently, or plan to incorporate further

  1. I’ve created a unique tool for extracting specific clauses from judgment documents in PDF format. My next step is to make this data publicly available, following the completion of my detailed clause annotation process.

  2. I’m currently employing a blend of OpenNyAI and similar ML models to work on a set of judgments.

  3. An interesting project I believe has academic potential involves harvesting data from ecourts and various other APIs. This data serves as the foundation for performing topic modeling and other sophisticated analyses in the realm of NLP and data science.

  4. As I think having a resource for easy access to government notifications is necessary, I’ve developed and shared a preliminary version of a website that serves as a directory for some of these. You can explore this initial version here: https://env-law-notification-browser.onrender.com/browse. This one will require me to go through to the original sources for each of them, as the government websites that host these often have restrictions on hyperlinking.

  5. With respect to government notifications that undergo constant change, the Indigo project offers a software to implement point-in-time classification. While not a legal dataset in strict terms, it depends on the Akoma Nsoto (or LegalXML) format for exchanging data, that I think largely conforms to the philosophy of sharing data on legislations.