A research team co-led by Cornell found that for schools without the resources to conduct learning analytics to help students succeed, modeling based on data from other institutions can work as well as local modeling, without sacrificing fairness.
“To use data-driven models, you need data,” said Rene Kizilcec, assistant professor of information science in the Cornell Ann S. Bowers College of Computing and Information Science. “And in many schools, especially lower-resourced schools that would benefit the most from learning analytics applications, data is rarely accessible.”
Kizilcec is a senior author of “Cross-Institutional Transfer Learning for Educational Models: Implications for Model Performance, Fairness, and Equity,” to be presented at the Association for Computing Machinery Conference on Fairness, Accessibility and Transparency (ACM FAccT), June 12-15 in Chicago. The lead author is Josh Gardner, doctoral student in computer science at the University of Washington.
Kizilcec and his team used anonymized data from four U.S. universities, and converted it into a common structure for the purpose of modeling which students are likely to drop out of college. Only the university-specific models – no individual student data, which raises privacy issues -- were shared between members of the research team.
More than 1 million students drop out of college each year in the U.S.; they are 100 times more likely to default on their student loan payments than those who graduate. This has led the federal government to impose regulations that incentivize colleges and universities to reduce dropouts by requiring them to report dropout rates, as well as rankings that account for graduation rates.
Kizilcec said that major institutions have the resources to conduct predictive data analytics. But institutions that could most benefit from that data – smaller colleges or two-year institutions – typically don’t.
“They have to rely on the services of a few companies that offer education analytics products.” he said. “Institutions can either build their own models – a very expensive process – or purchase an analytics ‘solution,’ with modeling that is typically done externally on other institutions’ data. The question is whether these external models can perform as well as local models, and whether they introduce biases.”
The goal of the researchers’ work was an accurate prediction of “retention” – whether each student who enters an institution for the first time in the fall would enroll at that same institution the following fall.
To assess the success of transfer learning – taking information from one institution and using it to predict outcomes at another – the team employed three approaches:
- Direct transfer – a model from one institution is used at another;
- Voting transfer – a form of averaging to combine the results of several models (“voters”) trained at disparate institutions to predict outcomes at another; and
- Stacked transfer – combining the predictions of models trained on all available institutions with the training data of the source institution.
The researchers used the three transfer methods, along with local modeling at each of the four institutions, in order to assess the validity of transfer learning. Predictably, local modeling did a better job of predicting dropout rates, “but not by as much as we would have thought, frankly, given how different the four institutions are in size, graduation rates and student demographics,” Kizilcec said.
And in terms of fairness – the ability to achieve equivalent predictive performance across sex and racial subgroups – the modeling performed well without sacrificing fairness.
Kizilcec said his team’s results point to more equity in dropout prediction, which could help lower-resourced schools with earlier intervention and preventing student departures, which cost the institution and can lead to worse outcomes for the students.
“It may not be necessary after all to allocate resources to create local models at every single school,” he said. “We can use insights from schools that have data infrastructure and expertise to offer valuable analytics to schools without these resources, and without sacrificing fairness. That’s a promising result for school leaders and policymakers.”
Other contributors are Christopher Brooks, assistant professor at the University of Michigan School of Information; Renzhe Yu, assistant professor of learning analytics and educational data mining at Columbia University; and Quan Nguyen, instructor of data science at the University of British Columbia.
Support for this work came from Google and Microsoft.
By Tom Fleischman, Cornell Chronicle
This story was originally published in the Cornell Chronicle.