Chi Explores Essence of Big Data
Whether you noticed or not, you are receiving and creating countless data in your everyday life, sometimes merely by sending messages and browsing items on a shopping site. Many fields, such as medicine and entertainment are data-rich, which drives researchers to find new ways to capture and analyze this rapidly increasing information.
Carnegie Mellon University’s Yuejie Chi is one of these researchers.
“There’re lots of interesting questions about how you can model such data and how you can extract information from these data,” said Chi, an associate professor of electrical and computer engineering. “They allow me to apply the type of tools I know to some practical problems that domain experts might be interested in.”
For her research, Chi earned a Presidential Early Career Award for Scientists and Engineers (PECASE). Established in 1996, the PECASE is the highest honor bestowed by the United States Government to outstanding scientists and engineers who have begun their independent research careers and have shown exceptional promise for advancing their fields.
Chi’s research focuses on representing data efficiently to reduce complexity and improve decision making.
“We can obtain plenty of information from big data, but the data we observe and collect every day can be highly redundant, messy, and incomplete,” Chi said. “Take movie sites such as Netflix as an example; the users may only review a small number of films even though there are thousands of films out there.”
How, then, can people extract useful information from these raw data? Though overwhelming at first glance, the entries in big data matrices can be correlated. There may be millions of users in a movie site, but they have many similarities such as age, country of origin and educational background. Likewise, movies can have the same genres, directors and lead actors. By studying entries by their correlations, researchers can obtain their hidden features. By focusing on these latent variables, movie sites can predict the missing entries and what movies the users might like. In this way, they can design algorithms to build an effective recommendation system.
“You don’t directly just think about the data itself; you’re trying to get some structures,” Chi said. “Once you get a good model of the latent structure, you can think about solving an inverse problem where you try to recover those latent structures using optimization. So we’re studying how to design algorithms to recover these structures.”
Aside from recommendation systems, Chi also uses latent representations to examine problems associated with imaging modalities. Biologists build devices, such as single-molecule super-resolution microscopy, to look at structures within cells, but the images they collect often lack the desirable resolution due to limitations of the device. By studying latent structures, Chi’s team has developed a new algorithm that significantly enhances the image resolutions; it uses the same available data but fewer computational resources.
Recently, Chi has been developing algorithms for distributed optimization. Nowadays, people often distribute data to different machines, as the data sets are too massive to fit onto a single device. Once they establish a distributed setting, however, communication issues may arise among individual machines. There may be adversarial events, and some entities may not want to share data with the central location for privacy reasons. Thus, Chi aims to design algorithms that are communication-efficient and resilient to outlier events.
“Once you know how to represent your data, you can leverage the structures in your algorithm design and achieve the goal more efficiently,” Chi said.