Guest Post: E-Discovery and The Enron E-Mail Dataset Research

21 October 2009 Internet, IT & e-Discovery Blog Blog
Author(s): Peter Vogel


Before Dave Grant joined Gardere as the Director of e-Discovery, he was responsible for e-Discovery at Enron in the last few years before its total melt down and was responsible for managing more than 1.25 million documents.   While at Enron, Dave responded to more than 100 subpoenas from various states and federal agencies. The Enron database has become a focal point of eDiscovery research.    This Guest Blog about the Enron database is part of a bigger picture regarding academic research for developing efficient tools to improve eDiscovery.


I welcome Victoria VanBuren as the first Guest Blogger with her blog concerning the Enron eMail database. Victoria runs the DISPUTING blog with Karl Bayer in Austin, and has a great knack for posting interesting blogs and finding blogs on important topics. She is also a co-founder and an active participant on theLinkedIn Commercial and Industry Arbitration and Mediation Group. In addition to being a lawyer, Victoria is working on a degree in computer science so and I’m sure we will see Guest Blogs from her in the future.     




By Victoria VanBuren  


The U.S. Supreme Court granting of certiorari to former Enron CEO Jeffrey Skilling dominated the news headlines last week. Interestingly, the Federal Energy Commission (FERC), during its investigation into Enron’s involvement in the energy crisis of 2000-01, made available to the public a large database, called the “Enron Corpus.”  This dataset consists of about half a million e-mail communications from former Enron senior executives and energy traders.


Enron E-mail Dataset Research


Because of its size and public status, the Enron Corpus is a rare and valuable tool for experimenting on text classification methods. After FERC posted it to the web, this dataset has been the subject of research by computer science departments of several universities, including the Massachusetts Institute of Technology and Stanford University. The summer of 2009, the team at TREC Legal Track, an organization co-sponsored by the U.S. Department of Defense, started conducting research on the Enron Corpus with the purpose of improving large-scale search techniques.  


Our Research – Bayesian Text Classifier


The spring of 2009, computer science students at Texas State University David Villarreal, Thomas McMillen, Andrew Minnick, and I, under the supervision of computer forensic expert Wilbon Davis  utilized  the Enron Corpus to train a Bayes-based algorithm to classify the Enron e-mails into relevant and irrelevant to a given legal issue. This type of algorithm is commonly used by e-mail spam filters.


The Results


The team hoped that this mathematical approach would achieve better accuracy levels than the ~ 20% found using Boolean keyword searching, a method employed by many lawyers. Surprisingly, the Bayesian filter found e-mails to be known relevant at averages ranging between 43% and 66%. And as expected, the irrelevant accuracy results were even higher, averages ranging between 44% and 77%. Texas State University published the Technical Report last week and it can be downloaded for free here.           




This blog is made available by Foley & Lardner LLP (“Foley” or “the Firm”) for informational purposes only. It is not meant to convey the Firm’s legal position on behalf of any client, nor is it intended to convey specific legal advice. Any opinions expressed in this article do not necessarily reflect the views of Foley & Lardner LLP, its partners, or its clients. Accordingly, do not act upon this information without seeking counsel from a licensed attorney. This blog is not intended to create, and receipt of it does not constitute, an attorney-client relationship. Communicating with Foley through this website by email, blog post, or otherwise, does not create an attorney-client relationship for any legal matter. Therefore, any communication or material you transmit to Foley through this blog, whether by email, blog post or any other manner, will not be treated as confidential or proprietary. The information on this blog is published “AS IS” and is not guaranteed to be complete, accurate, and or up-to-date. Foley makes no representations or warranties of any kind, express or implied, as to the operation or content of the site. Foley expressly disclaims all other guarantees, warranties, conditions and representations of any kind, either express or implied, whether arising under any statute, law, commercial use or otherwise, including implied warranties of merchantability, fitness for a particular purpose, title and non-infringement. In no event shall Foley or any of its partners, officers, employees, agents or affiliates be liable, directly or indirectly, under any theory of law (contract, tort, negligence or otherwise), to you or anyone else, for any claims, losses or damages, direct, indirect special, incidental, punitive or consequential, resulting from or occasioned by the creation, use of or reliance on this site (including information and other content) or any third party websites or the information, resources or material accessed through any such websites. In some jurisdictions, the contents of this blog may be considered Attorney Advertising. If applicable, please note that prior results do not guarantee a similar outcome. Photographs are for dramatization purposes only and may include models. Likenesses do not necessarily imply current client, partnership or employee status.


Related Services