Data science needed to reveal the secrets of big, complex data

26 May 2017
Professor Shaun Hendy

One of the biggest barriers to greater data use in business is not that data is big, but because it is complex.

You might have multiple customer records that you suspect point to the same person, but because they have changed address or phone number, or just someone just misspelled their name, you can’t be sure you are not dealing with multiple people.

Maybe you want to match data from different sources. Maybe parts of your data set were collected in different ways. Your big data may be a big mess.

Welcome to the world of complex data, a world where good data scientists are worth their weight in gold.

Working with a very large patent data set a few years ago, my research group spotted one of the most innovative regions in the world, smack bang in the middle of the State of New York – far from the Big Apple. Hmmm. It turned out that the patents were from New York, but not the county we’d found - it seems the geocoding had failed and assigned the patents to the geographical centre of the state.

Scientists encounter this type of error all the time. Climate scientists have to patch together millions of temperature measurements taken by different instruments and at different locations from all round the world. Their data is big, complicated, and about as messy as it gets.

They cope by approaching their problem from different directions. Using a variety of methods, they measure the temperature at sea level and then high in the atmosphere to check that what they see in one data set is consistent with the other.

This can be a useful approach to data in business as well. Big data is often touted as a free pass to avoiding the scientific approach of hypothesis testing. But if your data is complex, this can be a useful way to clean it up.

Data scientists put the science in data science by developing models or looking for patterns, and checking for outliers that don’t fit these.

In our case, my team and I wondered at first whether there was a large research centre in the middle of New York. A quick google search showed that the patents were coming from a small commuter town not a major industrial laboratory.

It is not always this easy.

But this is where the value in working with complex data lies. A data set that is hard to master could be your competitive advantage.

Another way to clean up complex data is to make at least some of it open. We work with the OECD’s patent data set, which they release to research groups around the world. Whenever we find a mistake, we let them know, and we benefit from other researchers doing the same. Our competitive advantage lies in the mathematics that we use to interrogate the data, not the data itself.

Data is often complex, whether big or little, but this is a big opportunity for those with the right skills. 


Professor Shaun Hendy is a member of the Department of Physics, part of the University of Auckland's Faculty of Science.

Reproduced with permission from the National Business Review (NBR) Data science needed to reveal the secrets of big, complex data, published Friday 26 May 2017.