
(Marcio Jose Bastos Silva/Shutterstock)
Like many research institutions, Harvard University struggled to manage the rapidly growing storage environment backing its HPC clusters. The university had little visibility into which research projects were consuming storage, and the 5,000 users across the Faculty of Arts and Sciences (FAS) had little reason to care. But thanks to storage insights enabled by a Starfish data management tool, the university developed a chargeback system that enabled it to bill users for the storage they used, which helped it get off the storage treadmill and put it on a course for sustainability.
Harvard University’s FAS Research Computing organization manages HPC and general compute resources for a large swath of the university, including core sciences like astronomy, physics, and chemistry as well as engineering–basically, everything at the school except the medical and business schools. The group manages one major supercomputer, Cannon, with 1,800 compute nodes and more than 1,000 GPUs, as well as a few smaller systems.
Before it started getting a handle on its storage environment, FASRC managed a mish-mash of various file systems and storage servers. According to Raminder Singh, the FASRC senior director who heads the organization, it had 30 petabytes of Luster systems and another 46PB of other file system types, including NFS and Isilon systems from Dell/EMC and others. It utilized a large number of white box storage servers, many of which were upwards of 10 years old and no longer under warranty.
FASRC struggled to add enough capacity to meet storage demands. According to Singh, who joined FASRC in 2018 and was promoted to the senior director role in June 2023, storage capacity was growing at a 20% to 30% annual rate. From 2020 to 2025, the amount of storage under management doubled, he said.

Storage at Harvard FASRC was growing at 20% to 30% per year before the chargeback system (Joe Techapanupreeda/Shutterstock)
About four years ago, Singh’s predecessor at FASRC recognized that the situation was unsustainable. A big part of the controlled storage growth was that researchers had no incentive to minimize their storage use. The cost of storage was not included as a component of researchers’ grant applications, which led users to grossly over- or underestimate their storage needs. Something had to change.
The idea that the FASRC director came up with was to implement a chargeback system. By tying researchers’ actual storage consumption back to their grants, researchers would be responsible for paying for the storage that they used. Having some skin in the game would hopefully incentivize researchers to become more responsible consumers of storage resources, and storage consumption would decline.
Other research institutions have implemented chargeback systems to tie the resources consumed as part of a project back to the grants received from the National Institutes of Health (NIH) and National Science Foundation (NSF). However, the federal government has strict accounting rules for chargeback systems. Researchers must be able to justify the costs, and the work must be auditable.
“There was the idea to actually make storage sustainable,” Singh said. “How we can offer storage as a service, where we can recover the cost? And to do that, we had to come up with certain tiers of storage.”
FASRC developed a multi-tiered storage system with different performance, price, and recoverability attributes. Tier 0 is intended for active analysis research data connected to the supercomputer, and costs $50 per year per TB, but no snapshots or disaster recovery. Tier 1 is intended for general purpose data and comes with daily snapshots, but is relatively price at $250 per year per TB. Tier 2 functions as an intermediary storage repository for older research data, with no weekly snapshots at $100 per year per TB. The cheapest form of storage is tape, at $5 per year per TB.
In addition to the multi-tiered storage, FASRC realized it needed a better way to figure out who is using the storage. How could it implement a chargeback system to track actual storage usage in such a busy research institution, with thousands of researchers across hundreds of labs creating billions of files across half-a-dozen file systems? FASRC technicians may be able to get insights for a single well-managed storage system using native tools, but getting the overarching view across the entire environment would be quite difficult.
The solution came in the form of a much beloved sea animal: the Starfish. Specifically, it was the unstructured data management tool from Starfish Storage that ultimately gave FASRC the insight it needed to get a handle on its rapidly expanding storage environment and to implement the chargeback system.
Starfish gives FASRC a detailed view into what’s being stored. Its patented index system allows it to build a global inventory of all of its storage assets, and tie that storage consumption back to specific users and research project.
“One big reason for picking Starfish was it can scan different file systems,” Singh says. “You want unified information to present to your users. They cannot log into three or four different systems to get that information.”
Harvard’s researchers typically aren’t computer experts who intrinsically know how to create an efficient and high-performance storage environment. They’re undergrad, graduate, and postgrad students who are enmeshed in pursuing scientific knowledge.
“If you go ask the researchers what they need, they don’t know. That’s what working in research computing from last 10 years taught me,” Singh said. “People will ask for 100 terabytes of storage as a startup package, and 90% of the time they only use 10% of it.”
Thanks to Starfish, the researchers have a relatively simple mechanism to gain some semblance of storage efficiency–by using the Starfish tagging system to mark which files can be archived.
“All they have to do is put a tag down and then the Starfish mechanism in the background does all of the operational housekeeping work to check those files into the tapes, into the archival system,” said Starfish Founder and Chief Evangelist Jacob Farmer. “And it provides the recovery mechanism to bring them back out again, and it provides the historic listing of what was there originally.”
Thanks to Starfish, FASRC was able to discover that nearly 50% of the data it was storing had not been accessed in the past two years. Armed with this information, the heads of research programs were incentivized to either delete unneeded data or to move data that still had value to the archive or its IBM tape library.
“When we started charging for certain storage, people immediately went ahead and deleted a bunch of things by themselves, and that in itself actually helped with the growth side of things,” Singh said. “A lot of this [data] was from people who actually left Harvard maybe five years ago, and their data was still there. That was their first reaction because now they are paying, and they don’t want to pay for the data which doesn’t even have value.”
The chargeback system not only generated $750,000 in direct revenue the first year and $1.5 million the following year, but the surge in data deletion and archiving to tape helped the group get a handle on its out-of-control storage growth. Singh now is looking to retire older, out-of-warranty storage systems storing multiple tens-of-PBs. At the same time, he’s looking to shift to enterprise storage system and away from open source systems. The revenue and savings generated by the chargeback system has given Singh credibility with the university’s financial folks to continue with the storage modernization.
“We’re trying to keep the cost under control, we’re trying to keep that balance, so that any new storage we are buying, we are able to recover costs,” Singh said.
But the success of the chargeback system had one more big impact: It gave FASRC the financial headroom to hire a storage expert to personally work with the researchers to come up with effective data storage strategies, which will drive even more storage savings in the future. In this respect, the implementation has been a win-win-win.
“The first two years was spent to have a system, have the tools, have everything available for this [chargeback system] to actually become practical. You start to see results after a couple of years,” Singh said. “We are starting the next phase of our storage project, where we will be consolidating some of the offerings based on the learnings we have from running the storage service center for the last two to three years.”
It was Singh’s predecessor who brought in Starfish nearly four years ago to pave the way for the chargeback system and the new storage tiers. But Singh has done much of the work of implementing the chargeback system, including the development of a new system based on the open source ColdFront software out of the University of Buffalo to automate much of the chargeback tasks.
According to Farmer, there’s an old saying in research computing to the effect that “there’s money for storage, but there’s never any money for storage management.” Perhaps it’s time to revisit that phrase in light of FASRC’s experience with Starfish?
“Storage management includes all kinds of things, like curating your data and sharing your data and charging it back,” Farmer said. “We’ve kind of always thrown the idea out there that if you bring in Starfish to help you with the operational side of saving money, like deleting and archiving and housekeeping, that that will get you the ROI that pays for everything else that you might want for data management.”
Related Items:
Starfish Helps Tame the Wild West of Massive Unstructured Data
Peering Into the Unstructured Data Abyss
Unstructured Data Growth Wearing Holes in IT Budgets
chargeback, ColdFront, FAS Research Computing, file system, Harvard University, hpc, Jacob Farmer, Raminder Singh, research, StarFish, unstructured data