Tidy data
Tidy data is an approach to organising tabular data so that it is easy to manipulate, model and visualise.
On this page:
Tidy data is data which has been organised in a way that makes it easy to manipulate within other software for analysis. By following tidy data principles, it will be easier for you and others to take advantage of the power and flexibility of computational methods to analyse your data. Many popular software packages for analyses (including Python, R, and MATLAB) work best when data is arranged in a tidy way.
More information: Tidy Data (Journal of Statistical Software)
Tidy data principles
- Each variable has its own column.
- Each observation has its own row.
- Each value has its own cell.
The aim is to make sure that each cell contains a single piece of information.
For example, if you have an existing date column in the format YYYY-MM-DD, you might prefer to split this into separate columns for each of the three components to make it easier to transform and analyse different facets of your data.
You can put these principles into practice when collecting and recording your data, or by transforming your data inside analysis software.
Tips for arranging data
It is often useful to:
- Keep an unmodified, original copy of the raw data in case you are required to produce it for verification or integrity purposes.
- Keep a separate working copy, which you can modify and transform as part of your analyses. Be sure to record the steps you have taken to move from raw to tidy data if you are cleaning up an existing dataset.
- Make your variable names human-readable and understandable, but keep them concise. Avoid spaces and special characters.
- Use one spreadsheet file for each table and don't combine multiple tables in the same sheet.
- Save tabular data in a universal open format such as Comma Separated Values (.csv) so it can be read by the widest range of software.
- Create a README.txt file outlining important contextual information about your data instead of relying on formatting and notes within the spreadsheet (which make it harder to import into analysis software). Include information about the variables, their units, information about your project, and any other useful information about the choices you have made when collecting, transforming, or analysing your data. This information will come in handy when you go to publish or communicate your work.
Contact
Research Data Support Services
Email: researchdata@auckland.ac.nz
eResearch Engagement Specialist
Email: Tom Saunders