Last month, the Massive Data Institute (MDI) organized a two-part panel discussion on “Data Blending: Tackling the Obstacles” in the Bioethics Research Library. Data blending is the process of associating a variety of organic and observational data with more traditional administrative and survey data.
The first panel was moderated by Jonathan Ladd, associate professor in the McCourt School, and featured four presentations by Zeina Mneimneh, assistant research Scientist at the University of Michigan; Quynh Nguyen, assistant professor at the University of Maryland; Josh Pasek, associate professor at the University of Michigan; and Ramanathan Guha, fellow at Google, on their data blending research. They also discussed limitations and challenges associated with different types of data merging.
- Dr. Nguyen presented on Geotagged Tweets as predictors of health outcomes. Dr. Nguyen blends traditional sources of data with geotagged tweets and tweets with place attributes to identify health indicators in different regions of the country. Using over 32 million tweets, she discussed health outcomes for different demographic groups.
- The second speaker was Dr. Pasek, who discussed challenges to linking traditional surveys with social media and other kinds of data. The largest challenge with traditional survey data include the increasing costs and the declining response rates. Dr. Pasek questioned how one can use data from different sources in a way that complements traditional surveys given that they many of the new forms of data being used are not designed to answer traditional social science research questions. These new forms of data are different by nature because of their generation process and potential ethical considerations with their use.
- Dr. Mneimneh discussed micro data blending and evaluated the merits in linking social media data with survey consent. In her talk, she discussed the complexities associated with getting consent by participants for studies that would link survey data to social media data, which she notes is better available now due to increasingly stable rates despite unknown measurement and data generation properties.
- The last speaker, Guha, presented on Data Commons, a platform for sharing publicly available datasets in a common way. He explained that Google would like to spearhead data and knowledge sharing using Data Commons. All the data in Data Commons is accessible to the level the owners setup and a knowledge graph has been constructed using the shared data.Guha explained that one outcome is to enable students and data journalists to use these data and learn more about the world by connecting different types of data shared through this platform.
The event then shifted to examine Data Blending in the Federal Government. Robert Groves, provost at Georgetown University and former director of the U.S. Census Bureau, moderated a conversation between Jeffrey Chen, chief innovation officer at the U.S. Bureau of Economic Analysis (BEA) and Stephanie Lee Studds, chief in the Economic Indicators Division at the U.S. Census Bureau (Census).
The discussion focused on when data blending is a viable option for enhancing agency data and the various strategies they have been used to overcome some challenges associated with this approach. At the BEA, Chen said researchers use multiple or diversified signals from blended data– including inflation and employment rates– to make GDP predictions. Cultural barriers and biases are common, and the BEA works to remove existing biases from data sources, and overcome cultural barriers between businesses, social scientists, and tech companies on differing approaches to data blending.
The Census Bureau, according to Stephanie Lee Studds, uses data blending to use third party data sets to fill in missing information within its own data. Some of the challenges she noted were in how transparent the variable construction process is at different companies and the variation in the level of reliability across different data providers. Studds stressed the importance of interagency cooperation to improve transparency and data quality.