Data classification: User perspectives

Corporate information resources regularly stretch into the terabytes (thousands of gigabytes) with many thousands (even millions) of individual files to contend with -- but the problem is now far more complex than simply "finding files". Finding specific files among this confusing hodgepodge of information is inefficient and frequently incomplete. Companies often fail to recognize the importance of their data and its impact on everyday business operations. The process of "data classification" attempts to fill this void by helping businesses understand what data is actually available, its location in the enterprise, how that data is being accessed and how it must be protected to meet legal and regulatory requirements.

Corporations face a glut of data. It's increasingly difficult to know what data is available, what data is critical to the business and how best to manage overall data assets. Data classification is a critical first step on the road to information lifecycle management, allowing business to better understand the information that they actually have, then organize and store that information to suit business needs. The best way to gauge the value of data classification is to see how other organizations are embracing the practice, and learn how they're using some of the available software tools to help streamline adoption.

Position for compliance and eliminate junk

It's important to meet government regulations for data protection and retention, and make the most of available storage space within the enterprise. According to Lara Helms, IT process analyst for information technology at a leading defense contractor, these were exactly the two issues that weighed most heavily on IT. "Our drives were always full, but it was dead data (and junk data)" Helms says. "We also have a lot of regulatory compliance issues around FAA [Federal Aviation Administration], DoD [Department of Defense] and FCC [Federal Communications Commission, so we're required to keep data for so long. But we also have retention policies that say we can't keep data past a certain date." The problem, Helms says, was that data had to be located and managed on a manual basis. With over 60 terabytes (TB) of content in its data center, including structured and unstructured data, manual processes presented a great deal of trouble. "When our drives filled up, we'd send an e-mail to our users saying 'clean up your data,' " she says. However, most users would simply move files around rather than delete them.

"We wanted a way to automate the process of cleaning that data and taking care of retention times," Helms says, who uses Arkivio software to evaluate content on the network. The results were astonishing -- a full 80% of its unstructured or Windows data had not been touched in over a year. The tool also identified unwanted/illegal content, such as Outlook .PST files, .JPG images and .MP3 files. "We were able to identify gig[abyte] upon gig of data that should not have been out there." Helms also notes that the software helped to simplify the search for content. "It was able to auto-identify and auto-migrate files for us," she says. This type of control also allows Helms to allot costs more appropriately. "I can charge back engineering for the space that engineering is using, versus just a blanket allocation. It's given our organization the ability to look and see who's using what disk space, and pay for what they use."

While Helms was pleased with Arkivio's offering, the data classification process did present its share of difficulties. "We did not do a good job with organizing our data to begin with," she says. Determining retention policies and realigning data for classification purposes proved to be particularly challenging. "It takes a lot of leg work to get out there and understand what data lives where." Helms recommends patience when planning any data classification project, and suggests approaching large projects in a piecemeal fashion. Classification groups should be as generic as possible to facilitate reuse across other company departments.

Tools can dramatically enhance storage planning

Even large and influential corporations can take false steps with data classification. Matt Decker, manager of IT services at Honeywell FM&T, echoes a familiar need for compliance and storage management, but reflects on some faulty decisions in the early stages of data classification. "We started off on the wrong foot: We tried to go toward an hierarchical storage management solution, but that's just moving junk from one pile to another," Decker says. "Finding a tool that we could use to understand what that data profile is got us back on track." Before classification, Decker says that storage was simply "feeding the beast" – neither managing for storage growth, nor understanding the data's value. Compliance needs must also be met, though that was not initially the most significant factor for Decker.

Today, Decker is using Arkivio software to manage the 5 TB of unstructured data currently in his domain. As with other users, Decker was astonished by the evaluation of data and its distribution through the enterprise. "To find a technology that enables you to do that is a very eye-opening experience," Decker says. "You think you know what you have, but until you actually take the blinds down and look out there, it's a mystery -- you 'will' be surprised." As one example, Decker realized that user's Recycle Bins demanded a significant amount of storage space. By adding a policy that categorized Recycle Bin data, it became a simple matter to move that data off of primary storage -- or eliminate it outright. "That brought back more than 100 GB in one day." While that may not sound like a tremendous amount of space overall, it is a significant percentage when compared against several terabytes. Decker appreciates the flexibility to see, categorize, implement policies and migrate data. Such decisions would have been impossible without a classification scheme in place -- and a corresponding tool to automate important tasks.

Although current classification tools (like Arkivio's) have been a benefit to Decker's organization, he does cite some areas for potential improvement. "One area that pops to mind is more granular reporting capabilities," he says. The reporting pie chart is informative, but many files are grouped into the "other" category, sometimes obscuring detail regarding less popular file types. Into the future, Decker would like to see other vendors appear and push data classification to the point where it becomes second nature for all types of organizations. Ultimately, Decker suggests that readers learn from his mistakes. "Don't lock in with the preconceived notion that you know what you're going to do with this data until you understand what the data is. Find a tool that will help you classify data, and then make the smart decisions based on that."

Plan ahead by knowing where you are

Many companies embrace data classification initiatives as a means of mitigating legal risk and meeting government compliance demands. But it's not all doom and gloom: many other companies experience a more fundamental need to simply know what data is available and where it's located in their storage architecture. VMWare Inc., an EMC company and leading developer of computing virtualization software, is one example where significant data growth demanded a better understanding of storage needs and utilization. Russell Heredia, technical operations engineer at VMWare, points out that it's no longer appropriate for product source code to co-exist along side family snapshots. "We're moving to a more distributed file server infrastructure and we need to keep tabs on what people are storing on the various file servers," he says. "You can't teach an old dog new tricks, so we need to 'enforce' policy." Heredia uses data classification tools from StoredIQ Corp. (an EMC Corp. partner) to understand and track the company's storage utilization. It's also a starting point for business-related issues (like ensuring that source code remains restricted to certain storage resources).

Heredia found the move to data classification to be surprisingly straightforward -- perhaps because VMWare took the time to establish practical classifications in advance. "It didn't take very long," he says. "I think we had a pretty good working idea of how we wanted to divide the data." Once implemented, data classification allows Heredia to see data regardless of its location, and identify the data based on its contents (rather than file meta data). The move to data classification has also given VMWare better insights into its storage utilization, allowing for more comprehensive storage planning into the future. Eventually, Heredia plans to implement some of the customized reporting features found in StoredIQ's product. According to Heredia, there are some down sides to the technology. "I'm not a big fan of the expert systems model that suggests that you should do 'x' or 'y' based on the data." He would prefer to retain manual control over the disposition of files. Beyond that, it can take a very long time to gather initial data sets. However, he is quick to emphasize that the advantages of data classification have far outweighed the limitations. Given the relationship between VMWare, EMC and StoredIQ, it's unclear why VMWare opted for StoredIQ (when EMC offers data classification products), or if EMC played any role in VMWare's choice of StoredIQ. However, Heredia could not be reached for further comment.

Take a critical look at the tools available

Perhaps the only thing more difficult than defining classifications is the choice of a specific software tool. A software tool can make or break a classification effort, so it's particularly important to evaluate offerings carefully before choosing a tool for your own organization. Michael Masterson, information systems architect at a Fortune 500 life science company, clearly recognizes the need for data classification in order to address numerous regulatory obligations. "It wasn't an IT decision," he says. "We're a regulated life science company. Many of our computer systems are validated to the standards necessary in a regulated industry, so we're audited by the FDA [Food and Drug Administration] and other agencies. The Sarbanes-Oxley Act certainly plays a role."

Masterson's organization plans a move to data classification in 2006, but he has started an evaluation of the available software products now. Unfortunately, he's been disappointed by the lack of decision-making capabilities in tools he's evaluated thus far. "I've been underwhelmed with the selection of tools because they're suited best for creating and enforcing multi-tier storage architectures," he says. "That entire approach addresses the mass of data, and it addresses the age of data, but it does not address the value of data relative to the business." He is also critical of software tools when it comes to multi-tier storage management. To date, Masterson sees the most potential in Abrevity tools for his particular application. The issue to consider here is that IT departments must make a coherent effort to test and evaluate data classification software, and gauge the tool's behavior against the needs of the company before committing to a single product.

This article originally appeared on

Dig Deeper on Financial reporting and compliance data management

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.