6 Questions data scientists ask when working with new datasets

Heiko Heilgendorff

Anele Ngcoya

Abigail Kemper

June 10, 2022

6 Questions data scientists ask when working with new datasets

When investigating new datasets, ADH data scientists ask themselves important questions about the datasets and the sources of the data they are working with. The main question to answer is can this data answer our research question? This question leads to several more questions which we share here.

There are various data sources with similar data points or datasets of interest, and deciding which is the most suitable for the use case or research question involves some analysis. For example, the ADH data scientists asked the following questions and did some preliminary analysis before determining that Our World in Data (OWID) was the best data to use in the ADH Covid-19 Observer (previously called the Resurgence Map), as it is a rich dataset that is updated regularly by a reputable and widely cited data source.

1. Who published the dataset - is it trustworthy?

Trustworthiness and reliability are dependent on the context and use case of data, but the source of the data plays a significant role in the quality of a dataset. Here are some factors that ADH data scientists look at when deciding on the trustworthiness of a data source:

Reputation & Authority: How long has the organisation or data source been in the industry or field of interest? A well known research institution with many years of publishing datasets used by national and international decision making bodies like the UN, WHO, and National governments, are more trustworthy than a new independent organisation with no established reputation. That is not to say that a new organisation cannot become established and develop their reputation and authority.
Risk: What are the consequences of publishing incorrect or inconsistent data for the organisation? Suppose the organisation is not held accountable for publishing incorrect or inconsistent data, or there is no incentive for them to verify or validate data they publish. In that case, certain caveats or limitations apply. For example, a blogger or influencer that only has their reputation and following to consider, may take the risk and publish controversial findings using unverified data for attention. They may not suffer any damage for publishing incorrect or inconsistent data.
Sphere of Influence: How many people and organisations make up the data source’s sphere of influence? If this is a widely cited source of data and used by influential organisations like national health ministries, then you can be more confident in the reputation and authority of the data source and, ultimately, the quality of the data.

‍

2. Can the data be used - is it accessible and well structured?

Ensuring that the dataset can be accessed easily and has been assembled using good data structuring practice/principles will determine how you go about analysing the data and ultimately using the data.

Data can be cleaned, restructured and analysed using tools like Excel and Google Sheets, (see our training on using sheets here: The Fundamentals of Data Journalism) or other data handling tools like Python and MatLab, there are a host of other digital tools that are available for data analysis.

3. How was the data collected and captured?

It is crucial to have an awareness of the methodologies used to collect, share, and store datasets of interest when trying to establish the accuracy of a dataset of interest.

For example, COVID-19 case numbers refer to all confirmed cases of COVID-19, where a person was tested for COVID-19, and the result was positive. However, there are potentially many unconfirmed cases that are not accounted for. Relevant national centres for disease control collect this information from hospitals and testing stations and publish the aggregated data accordingly. This kind of information is the metadata (view ADH datasets here) that accompanies a dataset and is critical to review when exploring the value of a dataset.

Knowing the correct definition for the data and exactly what the data point is a measure of can assist in determining the accuracy of the dataset. Questions data scientists ask about methodologies include:

How was the data collected?
What questions were asked?
What tests were done?
How was the data captured and stored?
What are the definitions and measures applied to these data fields?
How often is the data updated and shared?
Who updates the dataset?
Who or how was the dataset verified?

This awareness of data collection and structuring methodologies allows data scientists to understand how and why the data is structured accordingly. It also explains why or if there are any gaps in the data or corrections made to the data. For example, knowing that some National Health Ministries publish updated/corrected total COVID-19 deaths data can account for the updates made to OWID datasets. When there is a reasonable explanation for inconsistencies or data gaps, you can be more confident that the data is accurate.

4. How does this data compare to other data?

Comparing data from two or more sources for correlation and consistency also helps establish whether or not the data is a true reflection of events. This is a form of data triangulation. Various qualitative methods of analysis, such as desktop research and interviews with experts, can also be used to establish the reliability and trustworthiness of data. For example, in creating the Covid-19 Testing and Positivity Data Visualisation Tool, different data sources that publish and share Covid-19 related datasets were compared for consistency and correlation.

5. Does the data show what we expect to find?

Data Scientists and subject matter experts often have some expectations for what the data might show. Preliminary analysis of the data using simple data visualisations can reveal trends or patterns in the data that might confirm what you would expect to find or it may surprise you.

For example, in creating the New Tests and Positivity Rate Data Visualisation Tools, data was visualised and analysed to confirm accuracy. The visualisation of fields of data within a dataset of interest can reveal unexpected values. Examples of unexpected values in a dataset include above normal rates in a dataset (e.g. positivity rate of 110%) and null entries, this may require further investigation. Unexpected values in a dataset are often explained by the data collection, storage, and sharing processes; with Covid-19 related data these unexpected characteristics in a dataset are often a result of human error/habit such as lags between data collection and data sharing.

6. What questions can the data answer?

When analysing data, it’s important to know if the fields within the dataset can be compared or highlighted to support a point or answer the research question.

In our case, we wanted to source and supply data that journalists would find useful in their coverage of the pandemic in Africa. Our research question for the ADH COVID-19 data resources was: How severe is the COVID-19 pandemic in Africa? OWID provided near real-time COVID-19 epidemiological data for most African countries suited to answering this question.

This data also satisfied our use case because we needed accurate and reliable data supplied in near real-time regularity. We only selected certain data fields that suited the ADH use case. There are a wide range of indicators on the severity of the Covid-19 pandemic on the African continent, and an understanding of the needs of journalists in the African context plays a role in which data fields from the OWID dataset were used in creating our products.

More about why we used Our World In Data in our COVID-19 data resources

We have used OWID as an example of a trustworthy data source that provides reliable and accurate data. What characteristics of the OWID dataset did our data scientists analyse and what does this analysis entail?

OWID has built 207 country profiles which allow users to explore the statistics on the coronavirus pandemic for every country in the world; since the ADH focus is on African countries only some data fields were applicable to our use case. An extensive list of indicators is used to describe the cases, deaths, vaccinations, testing, and government responses related to COVID-19 of each country.

The following indicators were relevant:

new_cases
total_tests_per_thousand
new_cases_smoothed
new_cases_smoothed_per_million
total_cases
total_cases_per_million
new_deaths
tests_units
new_deaths_smoothed_per_million
total_deaths
total_deaths_per_million
stringency_index
people_vaccinated_per_hundred
reproduction_rate
people_fully_vaccinated_per_hundred
new_tests
new_tests_smoothed
new_vaccinations_smoothed
positive_rate
new_tests_smoothed_per_thousand
tests_per_case
new_tests_per_thousand
total_tests
new_vaccinations
new_deaths_smoothed
new_deaths_per_million
new_vaccinations_smoothed_per_million
people_fully_vaccinated
people_vaccinated
new_cases_per_million
total_vaccinations

You can download or explore this data here.

Be the first to know about our new courses and toolkits!

Africa Data Hub seeks to lower the barriers that African journalists face to access and use data in their storytelling around health and development.

This work is supported by the Bill & Melinda Gates Foundation.

Data resources

Knowledge and feedback