A diversity of data strengthens the versatile algorithm

Computer algorithms are being taught to interpret a variety of documents ranging from engineering journals to printed logs found in Iraqi police stations. But the quality of these computational techniques is limited by the amount and diversity of data used to train them, says Daniel Lopresti, department chair of computer science and engineering.

“Machine learning is based on the assumption that we can develop algorithms that approach a human level of understanding,” Lopresti says. “But because data sets are limited, we spend a lot of time focused on a tiny portion of the problem space.”

Enormous amounts of data are needed to train and test document recognition algorithms, yet many researchers rely on well-worn data sets, such as a collection of IEEE journals scanned in the early 1990s, or handwritten exemplars scribed by students.

That may change thanks to the tens of thousands of documents that Lopresti and his students have rescued from the former Bethlehem Steel labs and offices on the Mountaintop Campus that were recently acquired by Lehigh. This historic corpus includes routine reports, such as a binder that chronicles all the materials used in the blast furnaces for decades, as well as thousands of engineering drawings—“everything from drawings of tiny parts to floor plans for entire facilities,” says Lopresti.

“Some of the Bethlehem Steel offices were pristine,” says Barri Bruno, a senior who inherited the task of organizing and starting to scan more than 35 large filing boxes of papers and hundreds of tubes of drawings.

“Other offices looked like the apocalypse had just happened. Personal papers and coffee cups were left lying around as if people had left on a Friday and never returned.”

Working with Lehigh librarians, Lopresti’s team had just weeks to plan, and mere days to capture and cart away, a small portion of the documents abandoned in the labs. The scenario resembled the research project that Lopresti recently conducted for the Defense Advanced Research Projects Agency (DARPA), in which researchers used documents rescued from police stations in Iraq to improve the machine recognition of Arabic handwriting.

To date, Bruno has scanned more than 30,000 Bethlehem Steel documents, including handwritten logs, computer printouts, letters and notes. She estimates she is less than 20 percent of the way through the materials.

When researchers continuously use known, homogeneous data to develop algorithms, these algorithms may be tuned for that data but will not work well in the real world. Lehigh’s Bethlehem Steel collection, which Lopresti hopes can be released to researchers in the near future, offers something for researchers working on diverse problems.

“Printed documents may be interesting to one researcher, and handwritten documents to another,” Bruno says. “Because this is such a large collection and it is not synthetic”—it wasn’t created by researchers or scanned from a single source—“it becomes much more useful.”

In addition to making high-resolution scans, Bruno has been developing what researchers call the “ground-truth” for the collection. She is painstakingly creating an inventory of the documents, noting which portions of each page are printed texts, tables, handwritten notes, photographs or other content, so that researchers will have a benchmark to test their algorithms against. She has described the process in a paper that was presented at the 21st Document Recognition and Retrieval Conference in San Francisco in February.

“We have a long way to go,” Bruno says. The documents have to be checked for personal information, such as Social Security numbers, and Lopresti is working with attorneys for Lehigh and ArcelorMittal, which owns the assets of the former Bethlehem Steel, to avoid copyright concerns. If researchers can’t share the documents they work with so other academics can verify their work, then the documents are not useful for developing algorithms, Lopresti says.

But if all goes well, scientists designing new ways for computers to interpret printed and handwritten documents will soon have a treasure trove of data left for them by workers in the offices and labs at Bethlehem Steel.