2020-10-13 · Alex

Twin Cities - A day in the life of a data scientist

You’ve probably heard of Twin Cities, our fabulous new data product, which provides (synthetic) unit-level records of households, families, and persons across the whole of Australia. We combine publicly available data such as national address lists, and ABS Census DataPacks, to create Twin Cities. This allows us to dig into where Melbourne’s workforce lives, or even look at how children’s journey to school creates congestion and high-risk zones.

But you might be thinking:

“If the dataset is synthetic, how do I trust it? How do I know that it’s representative of the area I’m interested in?”

To answer this, I’m going to give you a quick intro into the Symbolix quality assurance process.

Twin Cities - quality assurance

Data science is all about testing: visual tests that data aligns with your expectations, tests of code logic, statistical tests ensuring your distributions are correct, and business rules and general sense checks. Like all our analysis and data products, Twin Cities goes through a thorough set of tests before we release it.

At Symbolix, we like our tests to be reproducible. Instead of a set of ad hoc tests, which we might forget in six months time, we write all our tests in reproducible documents which can be run at the push of a button. This means every time we make an update to Twin Cities, it goes through the exact same set of quality assurance tests. And the list of tests is huge and ever-growing…

(We’re sorry to the people we told about Twin Cities a year ago… we had to do a LOT of tests and improvements before releasing.)

Testing examples

Here’s a couple of examples of tests in the context of Twin Cities. Obviously I can’t list every single test we do (that would take all day) but hopefully this illustrates our process.

Pure logic

The main data source feeding Twin Cities is the Census. The Census is rich in information, but is a complex data set. A lot of variables relate and restrict one another, so we need to be careful that the internal logic and definitions of Census datasets are respected. For example, we have checks ensuring that different “count of person” variables align - e.g. if a household has four persons, two of which are parents, this means there must be at least one child. This also automatically places restrictions on the age differences between the child and parents.

By making sure that there are no records which fail basic logic checks e.g. “everyone classed as a dependent must have an adult in their family”, or “the sum of children plus adults in a family must not be greater than the household size”, we ensure that Twin Cities is internally consistent.

Mapping

Visualising spatial data on a map is also a great way to QA our set. For example, when testing that we’ve matched the address list data with the ABS correctly, we of course do checks that we’ve got the right count of dwellings in each area. Another good way to check if our data filters are doing what they’re supposed to is to actually map them and check that dwellings we’re familiar with are in the right spot. Our Twin Cities shiny (below: example map of dwellings) allows us to see the location of individual dwellings, and provides insight for how we should modify our filtering if something doesn’t seem right.

Visual checks

It can sometimes be easier to to spot patterns or issues using visuals rather than just numbers. So visualising data through plots and maps is one of the best ways to pick out if your process is working well, or has something to fix.

“One-to-one” plots like the below chart are super handy for checking our synthetic set returns the same statistics as ABS data does (check it out yourself).

This is a type of visual test. Here, what we expect is: for example, if the ABS says there are 907 flats in the SA3 of Sunbury, then we want Twin Cities to also say there are 907 flats in Sunbury. This means that if everything goes swimmingly, we expect all the data points on this plot to lie along the 1-1 line (marked by the grey diagonal). We do charts like this for every variable we add to Twin Cities.

I think we’re doing pretty well here :)

Want to find out more?

Hopefully this has given you some insight into our quality control processes.

If you’d like to know more about Twin Cities or would like to use it in your own organisation, please contact us!

Data Science Resources