It is important to understand and apply confidentiality principles, rules, and methods to make sure that you:
Using statistical methods correctly protects the confidentiality of data. Methods such as perturbation, aggregation, suppression, limiting access, and building synthetic or confidential unit record files keep data confidential. When data is confidential, no individuals, households, or businesses can be identified, and no unauthorised people can access the data.
Different organisations have different requirements relating to when they must or wish to protect the privacy, security, and confidentiality of data so that people, households, and organisations can’t be identified without their permission. This includes where we must or wish to protect the confidentiality of data throughout its life cycle — whenever we collect, use, store, and distribute it.
The terms privacy, security, and confidentiality are often used interchangeably, but each term has a different meaning:
What do statisticians and data analysts mean when they talk about confidentiality? How does identifiable data differ from de-identified or confidentialised information? Data identifiability is not binary. Data lies on a spectrum with multiple shades of identifiability. This is a primer on how to distinguish different categories of data in the NZ content.
Data that directly or indirectly identifies an individual or business.
Data that identifies a person without additional information or by linking to information in the public domain. Where an individual can be identified through connecting up information.
Personal, identifiable data like this are protected, and should only be released to the public providing we have explicit permission to do so.
For example: name, date of birth, gender.
Gender: Female.
DOB: 31/01/1983.
Address: 28 My Road
Name: Puzzles.
Type: Paper Stationery.
Employees: 34.
Expenditure: $398,000.
De-identified: Data which has had information removed from it to reduce risk of spontaneous recognition (likelihood of identifying a person, place or organisation without any effort).
For example: Data held within Stats NZ’s Integrated Data Infrastructure (IDI) and Longitudinal Business Database (LBD) is de-identified before approved researchers can access in a secure data lab environment.
Partially confidentialised: Data which has been modified to protect the confidentiality of respondents while also maintaining the integrity of data. Modification involves applying methods such as top-coding, data swapping, and collapsing categorical variables to the unit records.
Name:Unknown.
Gender: Female.
Address: Postcode 6012
Name:Unknown.
Type: Manufacturing.
Employees: 30-40.
Expenditure: $398,000.
Data which has had statistical methods applied to it to protect against disclosing unauthorised information.
Statistical methods include suppression, aggregation, perturbation, data swapping, top and bottom coding, etc. These prevent the unauthorised identification of individuals, households, or organisations. This data is publicly available.
For example: Stats NZ nz.stat datasets.
Name:Unknown.
Gender: Female.
DOB: 30-40 years.
Address: Wellington.
Name:Unknown.
Type: Manufacturing.
Employees: 10-100.
Expenditure: Under $500,000.
New Zealand businesses, institutions, and organisations rely on high-quality, timely, and accurate data for planning, research, and information. Good data helps New Zealand grow and prosper.
The New Zealand Data and Information Management Principles mandate that government data and information should be open, readily available, well managed, reasonably priced and reusable unless there are necessary reasons for its protection. These principles include:
“Open: Data and information held by government should be open for public access unless grounds for refusal or limitations exist under the Official Information Act or other government policy. In such cases they should be protected.
"Protected: Personal, confidential and classified data and information are protected.”
Much of the data collected in New Zealand is about individual people, households, businesses, and organisations — including sensitive personal and commercial data. Data gatherers and users depend on the personal and commercial trust and goodwill of the people they collect data from. Maintaining confidentiality is crucial to the New Zealand data system.
You’re often required by law to keep data confidential. If you provide data to an unauthorised user, or provide identifiable information without consent, you may be breaking the law. If the information becomes public, the implications are more serious.
Ways of keeping data confidentiality are governed by principles, laws, and ethics.
Principles and legislative requirements underpin the policies, standards, and guidelines for data confidentiality. For example, Stats NZ’s microdata output guide describes the methods and rules researchers must use to confidentialise output produced from Stats NZ’s microdata. The methods and rules are based on legislative requirements and four principles:
Other sets of principles that are relevant to data confidentiality include:
Data users must comply with relevant legislation. Legislation with specific requirements about keeping data confidential include:
You may also need to comply with other legislative requirements when using specific types of data. For example, the Tax Administration Act 1994 sets out requirements for protecting tax data and the Health Information Privacy Code 1994 sets out rules for collecting, managing and using health information.
An integral feature of any government data system is that it is underpinned by ethical principles, to ensure responsible data use and prevent harmful outcomes. Respect for people is about recognising the people behind the data and the interests of individuals and groups in how data is used.
Protecting confidentiality of data is an important way of showing respect for people. Whenever you release data you must take extra care with data that is personally or commercially sensitive.
Among the principles in the International Statistical Institute’s Declaration on Professional Ethics are that, when statisticians produce statistics, they must guard privileged information, and protect the interests of individuals and organisations.
Government agencies and other producers of official statistics are also guided by the United Nations Fundamental Principles for Official Statistics:
“it is the utmost concern of official statistics, to secure the privacy of data providers (like households or enterprises) by assuring that no data is published that might be related to an identifiable person or business.”
Protecting personal identifying information and preserving security of any output is emphasised in the Principles for safe and effective use of data and analytics developed by the Government Chief Data Steward and the Privacy Commissioner.
Other ethical guidelines will be relevant for specific types of research. For example, the National Ethics Advisory Committee’s Ethical Guidelines for Observational Studies covers research using health data.
It is essential to use confidentiality methods to protect individually identifiable information in microdata. You may also need to use them to protect larger datasets and data outputs.
Whenever we release data — to the public, a researcher, or any other kind of data user — we must make sure its confidentiality is appropriately protected.
We protect confidentiality by ensuring that details about individual people, households, businesses, or organisations are not identifiable, and cannot be deduced. Details must not be identifiable in the raw data, published statistics, or the data output users create.
Often you can release individually identifiable details, where you need to, provided you have received written authorisation from the individual to do so.
Unit record data and summary data — called microdata — is especially likely to be identifiable, as it is records of individual people, households, businesses, or organisations.
Statistical data that will be published needs to be organised in a way that prevents any individual details from being identified.
To protect the confidentiality of microdata — and where necessary, larger datasets — you can use one or more of these statistical methods:
Review your confidentiality business rules, methods, and processes regularly – at least every three to five years. You need to ensure that new technology, or the public availability of additional data, has not increased the risk of disclosure. Introduce new measures for protecting confidentiality if you need to.
Even with protection in place, there is always a risk of disclosing identifiable data. A data breach or disclosure breach happens when data is released that identifies a person, household, business, or organisation.
You must acknowledge that there is always a risk of a data breach happening. The Office of the Privacy Commissioner’s Data Safety Toolkit has guidelines on remedying, managing, and mitigating data breaches.
Perturbation – adding random noise to data – is a widely used data confidentiality method. Perturbation works by adding a random value to the data, to mask the data. This is called adding ‘random noise’.
Perturbation is a best-practice method. It is used by Stats NZ and by many international statistical agencies, including the US Census Bureau and the Australian Bureau of Statistics.
A count measures the number of individuals whose confidentiality is being protected.
A count magnitude (or value magnitude) measures a sum of counts (or sum of values) relating to the individual data you are protecting.
Stats NZ has developed a method which perturbs both count and magnitude tables: the Noised Counts and Magnitudes (NCM) method. NCM is part of Stats NZ’s development of an Automated Confidentiality Service (ACS). The ACS includes software, applications, and expertise to help users automatically apply confidentiality methods and produce consistent results.
In the NCM method, each individual data record is assigned a uniformly distributed random number. These random numbers are fixed across time, to ensure the same degree of perturbation is applied to the individual over time.
For count tables, random numbers generate a new random number for units grouped together in a cell. This is the basis for fixed random rounding to base 3 (FRR3). It ensures the same group of individuals will always be rounded the same way in related tables.
In FRR3, you randomly round counts to base 3.
For example, a four will be rounded to either a three (2/3 likelihood) or a six (1/3 likelihood). This is to disguise small counts. But since all table data are rounded consistently, they are protected against both:
We can protect information in counts tables by random rounding to base 3 (RR3). The counts are randomly rounded to base three in a consistent manner. This is to disguise small counts, but all cells in the table are randomly rounded. The effect is to make the output more confidential, by generally preventing individuals' data from being released.
For small numbers, where there is the most risk that individuals could be identified, there are larger percentage changes compared with larger numbers. For example, a cell with a one changed to a three has been changed by 200 percent, but a cell with 1,001 changed to 1,002 has been changed by only 0.1 percent. When analysing data, small counts need to be treated with caution but for the larger values the percentage changes in these cells do not cause a problem.
Use an n% ‘noise multiplier’ to generate magnitude tables.
The noise protects sensitive data where there is a disclosure risk but cancels itself out in larger collections of data.
Individual values are protected by at least +/- n% for the most vulnerable data.
Aggregation involves grouping categories together. You avoid disclosure by combining columns or rows into one new group. You combine or simplify data outputs. This reduces the amount of data available about individuals.
In the long run, aggregation is effective for striking a balance between releasing as much data as possible and limiting the work involved in producing tables.
Aggregation is useful when there are many cells with small numbers. By collapsing categories or combining data cells, you remove much of the sensitivity in the table.
You need subject matter knowledge to use this method. You need to know which values in the data are important for your data users, and how values have been aggregated in the past, so you can apply aggregation consistently.
Aggregation lowers the amount of detail in the final output data. You need to ensure that the resulting dataset is still useful for your users.
To maximise flexibility, code data at the lowest level of the classification possible.
Make sure that your data classifications and standards are relevant to your customer’s needs.
Classifications and standards should:
Classifications and standards must be unambiguous, exhaustive, and mutually exclusive:
Classifications and standards must be systematic and operationally feasible. To achieve this:
Use a common collapsing strategy for aggregations. Give classifications names that reflect both the most detailed and the collapsed levels.
When you suppress data, you do not report selected data. Suppression is removing data from an output that reveals individualised information.
If a data value reveals too much data about a person, household, or business, you can remove the data value from the output by suppressing. You replace its number value with another value, such as an empty space, a zero, or a character like ‘S’ or ‘C’. This is primary suppression.
But if you decide a data value is at risk, suppressing only that value is not enough. If you give subtotals or marginal totals, it is still possible to determine the suppressed cell’s actual value. You need to suppress other data values too, to protect the primary data value. Suppressing these other data values, in the same way, is secondary suppression.
You need to suppress other cells, so the value of the cell you first suppressed can’t be determined.
To suppress the fewest cells possible, complete a square of suppressions:
2 N total suppressions for an N dimensional table (for example,
2 2 = 4 total suppressions for a 2-dimensional table).
Secondary suppression is often not an easy task. To do it, you need:
Use these criteria to decide how to apply secondary suppression.
It is important to keep track of your publication history of primary and secondary suppressions, and to take care not to disclose data where you change which cells are suppressed, over time. Changing previous cell suppression trends might cause either:
You might want to:
You might need to publish certain cells for statistical information reasons, so you cannot use them for secondary cell suppression; this might give you problems finding enough, or appropriate, cells to suppress.
To test if a suppression pattern is effective enough, make sure that in each row or column you suppress, there are at least two suppressions. For a 2-way table, each suppression should be the corner of a square or rectangle of suppressions.
Primary and secondary suppression can be a time-consuming manual process. Some automated tools to help include Tau-ARGUS, G-Confid, and sdcTable.
Unit record datasets that contain information about specific people, households, and organisations (microdata) are most likely to reveal identifiable information. Protect confidentiality by imposing strict limitations on access to it.
Only grant access to microdata to researchers who state the statistical purposes for wanting access.
Where you approve access, consider drawing up a legally binding contract to control access to the data.
Stats NZ assesses research proposals to access microdata using the following principles:
Sometimes, negotiations for researcher access involve multiple data custodians. Each custodian should consider and grant access individually.
When you consider granting access to data, also consider the Privacy Act. The Act governs the use of data beyond the purpose for which it was originally collected.
In some situations, you may need to consult the Privacy Commissioner. For example, you may have a case where a legal provision parallels or constrains the relevant legislation. Or the privacy implications of the research may not be clear.
At Stats NZ, microdata researchers operate within the 'five safes' framework. We only grant access to microdata if all the following conditions are met:
The microdata output guide is Stats NZ’s best-practice guide for ensuring confidentiality in outputs from microdata. It covers how to use the statistical methods in greater detail, with examples.
You can publish open microdata once its confidentiality is protected. You use statistical methods to prepare synthetic unit record files (SURFs) and confidential unit record files (CURFs) that are suitable for general publication.
You use the methods of perturbation, aggregation, and suppression to process microdata so individual people, households, businesses, and organisations cannot be identified.
Publishing CURFs is done overseas, for example, the Integrated Public Use Microdata Series (IPUMS) published by the US Census Data for Social, Economic and Health Research. Open government initiatives have pioneered the release of CURFS, rather than national statistics organisations.
You build CURFs by perturbing, aggregating, and supressing microdata, until the data no longer discloses identifiable information about individuals, but is also still an accurate enough summary estimate of the data to meet the customers’ needs.
When you create CURFs, you may confidentialise data by replacing the real data with data you have processed or modelled. Lightly confidentialised CURFs are called partly synthetic data. Heavily confidentialised CURFs are called fully synthetic data, or SURFs.
Creating CURFs and SURFS is challenging and requires technical expertise. Research continues into how to automate the work. Current techniques can quantify the confidentiality importance of each variable and mitigate the risk for each variable. You can use k-anonymity testing, and Special Unique Detection Algorithms (SUDA), within automated tools like sdcMicro.
Often, the more heavily you confidentialise a record, the less useful it is to your customers or end- users. You need to strike a balance between confidentiality and usefulness.
If you cannot ensure data is confidential, you may need to withhold it.