How chains are mining a ‘gold rush’ of customer data P

Your computers are filled with gold, and the 21st-century gold rush is on. In the 1800s it was actual gold buried in California, and in the 1900s it was “black gold,” or oil, buried in Texas and Alaska. In the 21st century, the “gold” is data. Major retailers, search-engine companies, and social media companies have learned that capturing and using data is key to their success. Data—information buried in the text of legal documents—is becoming as important to the practice of law as it is to these other businesses. You don’t need to be a legal data scientist, but you do need to help mine what you create to find the treasures that will help your clients.

The Rush Is On

On January 24, 1848, John Wilson Marshall was building a water-powered sawmill for John Sutter near Coloma, California. Marshall wasn’t looking for gold, but he later claimed he knew what he had found as soon as he saw the flakes.

Gold mining peaked in 1852 when miners pulled approximately $81 million (in 1852 dollars) of gold from the ground, and the non-native population had increased to over 100,000. Over the next few years, annual production dropped, and the Gold Rush was over. The Gold Rush was important not only because of the wealth it created, but also because it reshaped the United States in the 19th century much like hydrocarbons reshaped it in the 20th century. The “gold” reshaping the 21st century sits in your computers and we call it data, but lawyers are ignoring this precious commodity.

Lawyers Are In No Rush

Each day the world generates and stores more data than all the data existing in all of recorded history prior to the computer age. Large corporations operating as retailers, search engines, or social media companies have found that the value of their businesses is as much in the data they gather as in the goods and services they sell. In Silicon Valley, what your business does almost is irrelevant; you create value through data.

Then we come to the legal industry. We can tell two versions of the legal industry’s story. The first version goes like this:

Recognizing the threat to clients from cybersecurity breaches, law firms settled on an unusual security approach. They accepted the futility of trying to build hack-proof firewalls and focused instead on what was in the computers, organizing it much like teenage boys arrange their rooms. As one law-firm leader said, “We decided that if our data was a mess and even we, who know it well, have difficulty finding and doing anything with it, then hackers would have more trouble and simply give up.”

To test the idea, a team of red-hat associates was tasked with hacking a law firm’s computers to find documents to use as drafting templates. The blue-hat defense team’s strategy was simple: “We pretended we were partners and randomly withheld helpful information from the red-hat team.” The red-hat team gave up after a few hours and drafted the necessary documents from scratch.

The second version goes like this:

While clients and the world around them screamed about data, lawyers continued their pursuit of obsolescence. Lawyers knew that, having not enjoyed knowledge management, they would enjoy data management even less. Adopting the “to do nothing is to do something” approach, lawyers treated their documents as dirt, not gold.

When asked about this strategy, a lawyer responded, “The world around us has been changing for decades and yet here we sit today, almost unchanged. To respond to this ‘data fad’ by doing something would go against our history of ignoring technology. Indeed, we are so worried about this trend of using computers that we may ask bar associations to file claims against computer companies, arguing that their products are used in the unauthorized practice of law.”

Choose which version you prefer. Either way, lawyers have not created and stored legal documents as if the data they contain is valuable.

Show Me the Data

What data is in legal documents? There are many types, but we will focus on the data trapped in text. This is the data that helps solve client problems, which means we exclude hours billed and fees.

We can begin with a question often heard in negotiations. Your client wants an earn-out clause in the contract, the other party resists, and your client asks, “What is market?” You poll your partners, combine their responses with your experience, and answer, “Earn-out clauses are standard.” The opposing party’s lawyer responds, “I’ve been doing this for 30 years and I’ve never seen an earn-out clause in this type of contract.” Stalemate.

Trapped in your files (and in the other lawyer’s files and your client’s files) are better answers. Those files have data. What clauses were used, what notice periods were given, what liability triggers were accepted, and so on. Each contract has dozens–perhaps hundreds–of useful data points, but you didn’t treat those contracts as if they contain data so you can’t get to it.

Over the past 20 years, Nobel Prize winner Daniel Kahneman and other psychologists and behavioral economists have taught us that human decision making is burdened with many rules (scholars call them “heuristics”). These rules guide our behavior and how we answer questions from the simple to complex. We use heuristics to get through a very messy world without getting bogged down. Heuristics help us quickly answer the “what is market” question.

Unfortunately, heuristics often lead us in the wrong direction in complex situations. They sacrifice obtaining and analyzing data for speed. You may have been wrong when you said earn-out provisions are standard (after all, you were relying on human memories), but data would have told you the correct answer. You didn’t take the time to check the data, in part because you couldn’t get to it quickly. Lawyers have skipped data, relied on heuristics, and hoped for the best. Publishers such as the Practical Law Company are attempting to fill some of these gaps, but their answers are based on their data, not your data.

The Stories Data Could Tell

Lawyer’s practices tell stories of risks and responses. What was the story of the deal? What were the terms? Why did the clients focus on certain clauses? What were the main risks, and what were the responses? Look at any interaction between a lawyer and a client and it leads to a story rich in details and data, yet like the folklore of some aging tribe, those stories fade away over time.

Lawyers have never really thought about the record of their services as data. The documents stored on computers are simply digital forms of the paper files. The who, what, where, when, and how may be there, but lawyers still believe it isn’t the computer that can decode the documents, it is the humans.

Lawyers had a reasonable argument for neglecting the data until the past few years. Excel gurus could manipulate numbers, use complex algorithms to find anomalies, and project trends, but little could be done with text. The available software wasn’t up to the challenge. It is dangerous to point to any one thing as driving a movement, but the need to access the exponentially growing blob of information on the Web certainly gave adrenaline to computer-based text understanding and, more recently, has helped programs aimed at manipulating legal text.

Fields such as machine learning (as Google explains on the Google official blog,, what IBM’s Watson used to win Jeopardy! and Google’s DeepMind used to beat the Go champion) have had their most visible impact in the legal industry on e-discovery. As little as 10 years ago, associates were mired in stacks of paper eyeing each document and deciding whether it was responsive to the opposing party’s discovery requests. Today, computers sift through those documents using algorithms containing predictive analytics, and parties review and produce documents without a person touching a piece of paper.

E-discovery was a shock to the legal system, but it was just the beginning. We are now developing tools that can “read” those documents and write the story. We already have software that writes sports stories from data and corporate-performance articles from annual reports. Although writing a case summary is more difficult, it may not be as much of a stretch as you think. To begin, we need data.

Filing Isn’t Data Management

Most e-discovery vendors have been around less than 20 years. They quickly moved from simple “X within five words of Y” searches to state-of-the-art text mining, but they found out along the way that the real problem is getting to the ore—finding the documents to mine for data and getting the documents ready to be mined.

Lawyers, both in law firms and law departments, have never been good about systematically organizing and storing materials, whether in paper form or on computers. They use the teenager’s room approach: everything has its place, and that place is where it lands when they drop it. For law-firm attorneys, that means that documents are stuffed in folders, which are stuffed in expandable folders, which are stuffed in file rooms. Law-department attorneys stuff everything in file cabinets or boxes sent to corporate storage. The data in those papers and in their twins sitting in the computers is like the teenager’s missing homework—somewhere in the mess but almost impossible to find.

To find where lawyers stuff things, we use document-management systems or even knowledge-management systems that point us in a certain direction. If the systems work, however, all we have found is the document. To the computer, the words on the page could be any set of abstract symbols. If this is our data, it isn’t very useful.

Find It Then Hire a Data Munger

The first challenge for most data scientists is getting data into a useable form. Lawyers, however, must first find the data before they can move to cleanup. Data scientists call the data cleanup step “data wrangling” or “data munging,” and it is a big one. Data wrangling can eat up to 80 percent of a data scientist’s time.

Think about some of the data in a law firm that is used every day: client information. Firms have systems for keeping track of client information, whereas law departments keep track of the lawyers they use. Check your system and you will find out-of-date, incomplete, duplicate, and incorrect entries. Imagine the time it would take to freeze the system and have someone clean up the data. Of course, as soon as they have finished and you resume using the data, it would be out of date.

Now apply those problems to your real data—the data built into all of the documents you have created. Even if you have a knowledge-management system, your data is not ready for use. At best, you have a collection of documents with some simple cataloguing by field. Your knowledge-management system uses an unsophisticated search process to locate documents. When you find some documents, you can’t do much with them except use them as templates. You know how to find the data, but you still can’t get to it. Definitely not state of the art.

Startups in the legal industry are breaching the use barrier by applying analytics similar to what search-engine and social-media companies use. One small but growing example is computational linguistics (CL). Put simply, CL applies statistical tools to text. Through machine learning, computers can use CL tools to understand text at a far more meaningful level than “supreme w/5 court.” CL tools take us down the path of computers “reading” documents to get data. CL tools in the legal field are in the early stages, but the tool makers all face the same challenge. To teach the computers to read, they need documents in useable form. Imagine wanting to teach your children to read but having no books. This is where lawyers enter the picture. By recognizing that the data built into and being built into documents is the “gold” that will help them protect clients, lawyers take the first step. The second step is to begin transforming documents into data, and the third step is to store whatever new documents are created so their data is readily accessible.

Data Is Essential

Data is becoming an essential part of a modern law practice. Given the choice between lawyers who use heuristics and lawyers who use data, clients will opt for certainty over guesses. We have seen new companies that are building businesses on this trend. For example, Lex Machina assembles all the information it can find about patent lawsuits and provides lawyers with analytics to use in building case strategies. Kira and RAVN search due diligence rooms and extract data from documents.

New technologies will increase our ability to gather data and expand decision-making based on data. Blockchains, the technology that bankers fear will disrupt their industry, has the potential to disrupt part of the legal industry as well. A blockchain is a distributed database (whereas a bank database is centralized at the bank, a distributed database sits in many locations simultaneously). Each chain record or “block” holds data, a program, both, or some other discrete information. Blockchains harden the records against tampering through strong encryption and distribution and reduce the need for intermediaries.

Clients and lawyers can create smart contracts using blockchains. A smart contract is a contract written in code a computer can execute. In theory, any contract clause can be written in code and built into a smart contract. Instead of people reading the contract document and evaluating its terms, the computers decide what to do according to the code. The smart contract code governs the transaction and sits in the blockchain, which ensures that the contract terms are not changed.

If you still have trouble seeing where smart contracts will be used, think about these examples:

  • Instead of recording property titles in books or databases at the registry of deeds, real estate lawyers record a blockchain title for each parcel. Each property transaction is recorded as the next block on the chain. The land registry becomes a distributed database, and the need for central recordkeeping goes away (or at least is reduced). Proving title is easy: your encrypted record must match what is in the blockchain.
  • Your client wants to lease a fleet of small cars for pizza-delivery services. The smart contract contains, among other things, the payment terms. If your client misses a payment, the dealer’s bank (which automatically watches for payment from your client’s bank) sends a signal to the cars disabling them. The next time your client wants to make a delivery, it finds that the cars have been electronically repossessed. If the cars are self-driving, they drive themselves back to the dealer.
  • Your company enters into a revolving credit agreement with a consortium of banks. The agreement contains various financial compliance terms. The computer reviews the quarterly financial statements (filed in XML format with the bank) and calculates the ratios using the definitions set out in the smart contract. Depending on the ratios, the lending consortium adjusts the interest rates or sends a notice to the borrower claiming a default.

You might think these examples sound far-fetched, but they aren’t. Honduras and the Republic of Georgia are rebuilding their land title registries using blockchains. Major banks are looking into blockchains for certain types of loan transactions, such as serving traditionally underserved communities.

Remember those Silicon Valley companies that learned that data drives value? Blockchains will make data capture easier for banks and other users and drive new value in finance and law. Lawyers can use terms and parameters pulled from smart contracts to create models about what works and what doesn’t. If the data shows that certain terms or parameters correlate highly with problems, lawyers can update the code. All future smart contracts will use the new code to minimize the risk. One click of the icon and the smart contract embedded in clients’ computers reflects the latest thinking on risk. No more worrying about tracking every template and replacing it with a new one. Private blockchains—ones that run between the lawyer and client—guarantee that clients always use the latest version of a smart contract. Data will drive changes in behavior to mitigate risk.

Corporate, real estate, estate and trust, and other lawyers who do not understand blockchains will face challenges. Companies using smart contracts will still need lawyers to set the basic terms, but if a lawyer doesn’t understand the technology of the smart contract, clients will be less likely to use the lawyer to establish the contract.

Why Help the Bad Guy?

What about the cybersecurity threat? In the two scenarios I described at the beginning of this article, the lawyers are concerned about hackers getting to the data. Experts say there are two types of companies: those that have been hacked, and those that don’t think they have been hacked. If you treat what you have on your computers as data, and if it is inevitable that hackers will get into your computers, aren’t you courting disaster?

Cybersecurity experts agree that law firms and companies should assume that they will continue to be hacked. That doesn’t mean they should give up the battle to keep hackers out; rather, they should do what they can to keep the hackers at bay and ask what they should do if the hackers get in.

Just because a hacker can get into a computer doesn’t mean the hacker can access, unencrypt, and assemble all the data it wants. You have a security alarm on your house, but you don’t leave all of your valuables lying on the kitchen counter. Cybersecurity is a challenge, not a bar, to keeping and accessing data.

Mine the Data Now

Lawyers have believed for over 100 years that formal legal study is necessary, but that they can learn everything else quickly. Litigators are famous for believing they can litigate an employment case in the pharmaceutical industry this week and an antitrust case in the retail industry next week. Large firms have moved beyond this by making everyone specialize (and sub-specialize), but the feeling still exists so lawyers wait and watch. When they see a tool or idea become well-established, lawyers make their move.

Lawyers might believe they can wait until everyone is deep into data and then put their toes in the water, but it doesn’t work that way. The companies in the retail, search engine, and social media industries I mentioned have compiled massive databases for more than a decade. It is unlikely anyone can match them. In fact, recognizing that the prize is data and not tools, Silicon Valley has embraced a new trend. They make available to everyone (i.e., open source) many of the sophisticated software tools they develop.

Why would they open source the tools? Because by doing so they get help from others who test and improve the tools, and the scientists who developed them are able to showcase their work, which is an important part of attracting and keeping talent. However, these companies know that without the incredible databases locked in their computers, others will not be able to use the tools to gain the same insights. The tools help, but the data is essential.


Lawyers have yet to realize how important data is becoming to the practice of law. Law firms and law departments will need help from academia, consultants, and others to understand how to keep documents as data repositories and how to employ the data-mining tools. First, however, lawyers must realize that what they create is data and that data has increasing value to their clients. Each firm and each department should build a proprietary database. They must treat the documents they create as if they contain gold, and then they must learn how to find and use that gold. Welcome to 21st-century mining.


Accessibility Toolbar