Lies, damn lies and CNCF Surveys

The CNCF Surveys have a number of fundamental errors that question the validity of the conclusions they draw.

Lies, damn lies and CNCF Surveys

Spending hours a day pouring over COVID statistics has the unintended consequence of improving numerical literacy for many of us. Unfortunately the team at the Cloud Native Computing Foundation (CNCF) responsible for it's annual survey are not amongst that number.

The root of the many issues with the CNCF Surveys is poorly structured questions and responses, that leave too much scope for interpretation of data. To compound this, the Survey Reports, which the CNCF publishes as it's interpretation of the data, have little transparency on how it comes to the conclusions it does.

Thankfully the folks at the CNCF do work in the open, so they've made the response data available here, which I've used to dive into the data to understand the problems with it.

Poorly structured questions and responses

The majority of the questions in the survey offer predefined responses that do not cover all available responses to a question, or cover multiple ways to answer the question which then get treated as distinct answers.

An example of this is the question "Please indicate if your company/organization currently uses, or has future plans to use, containers for any of the below options" which only offers the options of "Currently Using" or "Future Plans", but no option to say "Not using and no plans to use".

In the strict framing of the question this is correct, as the question is only about current and future use. Yet the way the data is presented in the survey reports is the overall growth of container usage. This question doesn't factor overall growth as it doesn't address respondents not using and with no plans to use containers.

The other way responses are distorted is questions with multiple responses that are not distinct, allowing respondents to answer multiple ways to mean the same thing. The best example of this in the report is the question "Which of the following data center types does your company/organization use? Please select all that apply.".

The responses to this question in the 2020 survey are "Private Cloud / On Premise", "Public Cloud", "Multi-cloud", "Hybrid Cloud (i.e. both)", "Other", "N/A or don't know".

The accompanying survey report defines Hybrid and Multi-cloud as follows "hybrid cloud refers to the use of a combination of on-premises and public cloud. Multi-cloud means using workloads across different clouds based on the type of cloud that fits the workload best."

When the data is reported though, it offers these categories as distinct. For example the figure for Public Cloud only includes respondents who answered "Public Cloud", not "Hybrid Cloud" nor "Multi-cloud". If a respondent is using a private cloud, and a public cloud, but only selected hybrid cloud, the report does not reflect that respondent is using public cloud.

Respondents also did not answer this question uniformly, for those using a hybrid cloud model, some would have just selected hybrid, others would have selected private, public and hybrid, and others just private and public.  The result is the survey understates the number of respondents using each of the deployment models.

Transparency

The other big problem with the survey reports is transparency. Upfront it declares how many people responded to the survey, but when the report examines responses it doesn't declare how many answers they received to a question.

The issue with this is that the 2020 and 2018 survey datasets both suffer from a number of incomplete surveys.

  • The 2020 survey has 259 out of 1,324 responses where the respondent didn't answer any questions beyond the demographic and basic cloud usage questions.
  • The 2019 survey data set has no incomplete responses, so it is possible to base a report on complete responses only.
  • The 2018 survey has 469 out of 2,400 responses where the respondent didn't get past the basic questions on CNCF project usage.

There are many failures here, from the UX of the survey tool, the question structure and choice, down to the decision to include incomplete responses. Potentially the biggest failure is that inclusion of incomplete responses isn't declared in any of the survey reports, and the impression is that the percentages represented are of the complete data set, not of the answers recieved to a question.

The incomplete response problem immediately brings into question any percentage based metric presented in the 2020 and 2018 survey reports, as there is no transparency as to what it is a percentage of.

Diving into the data we get a better view of how these issues impact the reports.

Containers in production?

The question "Please indicate if your company/organization currently uses, or has future plans to use, containers for any of the below options", which is asked across 4 categories, does not factor in respondents not using or with no plans to use containers.

The 2020 report states that "92% of respondents say they use containers in production", and "95% of respondents use containers in the proof of concept (PoC), test, and development environments".

The CNCF Survey Report graphs this data with the image below.

Page 6, CNCF 2020 Survey Report

When the survey states "92% of respondents" what do they mean, because 92% of respondents to the survey didn't even complete that question in 2020.

The container usage numbers showing in the survey reports have taken a simplistic approach of dividing the number of "Currently Using" responses, by the total number of responses for the question, per category.  I don't have a problem with this approach as it means the math is simple to reproduce. The problem is the data gathered doesn't reflect no usage or no plans to use, whilst being presented as a view of overall container usage.

The table below provides a breakdown of the 2020 Survey data for this question across the 4 categories., and how the percentages were calculated.

Proof of Concept (PoC) Development Test Production
Total Responses 902 927 928 969
Currently Using 856 885 878 891
Currently Using(%) 94.90% 95.47% 94.61% 91.95%
Future Plans 46 42 50 78
Future Plans (%) 5.10% 4.53% 5.39% 8.05%

To answer the question, what does the CNCF mean in the 2020 survey report when they say "92% of respondents"? They mean "92% of respondents that answered 'Currently Using' out of all the respondents that answered either "Currently Using" or 'Future Plans' for that specific category".  

What is a more accurate picture of container usage across these categories? Looking at the raw data and working through various scenarios we might be able to find something more realistic.

If we make the assumption that anyone who got to this question, but didn't answer because the available options didn't apply as they are not using and have no plans to use containers we get a different picture.

To improve accuracy we'll discard the non-responses from folks who didn't complete the survey, as they never got to this question. We'll use the Completed Responses number from the table below.

Year Total Respondents Abandoned Responses Completed Responses
2018 2,400 469 1,931
2019 1,337 0 1,337
2020 1,324 259 1,065

The table below shows the total number of responses "Currently Using"  for 2019 and 2020 surveys, and "Today" for the 2018 survey.

Year Proof of Concept (PoC) Development Test Production Current Use (any category)
2018 1,368 1,435 1,395 1,271 1,765
2019 1,049 1,137 1,102 1,091 1,267
2020 858 885 878 891 960

The next table breaks out just the "Future Plans" responses.

Year Proof of Concept (PoC) Development Test Production Future Plans (any category)
2018 170 233 244 471 592
2019 68 91 110 179 241
2020 46 42 50 78 114

Using this data to create a comparitive graph to the CNCF 2020 Survey Report results in this.

In the graph above I've added an extra category called "Currently Using", which is the percentage of respondents that completed the survey who answered "Currently Using" to any of the four categories.

This aggregate category actually shows a decline in container usage from 2019 to 2020. Do I think this is the reality? No. I think this is the result of the sample size declining in 2020, demographics of respondents changing (more execs answered in 2020), and potentially the survey tooling changing. These are assumptions though.

Overall the picture painted by the graph above shows a flattening of adoption in line with a product adoption s-curve, not the growth shown of the CNCF report. Even with the decline in responses for "Currently Using", the overall usage of containers is very high at 90%, and a slowing of growth should be expected.

Is this new graph I created accurate? I would say it's a more accurate representation of the data, but not wholly accurate. It uses a sensible assumption to infer conclusions around non-usage of containers, but the assumption does have scope for error. A better designed survey would result in better data, and more accurate conclusions.

Cloud Usage

Let's look at the cloud usage question "Which of the following data center types does your company/organization use? Please select all that apply."

The way the answers got represented misinterprets these answers. The graph below from the 2020 CNCF Survey shows cloud usage from the 2019 to 2020 reports.

Chart showing clustered bars for cloud usage, with Private Cloud for 2019 at 45% and 52% for 2020, Public Cloud at 62% and 64% for 2019 and 2020, Hybrid Cloud at 38% and 36% for 2019 and 2020, and Multi-cloud at 26% for 2020.
Cloud Usage, page 4, CNCF Survey Report 2020 

These numbers are just the responses graphed, without catering for multiple answers for the same thing.

For example responses where folks using hybrid cloud, just selecting the "Hybrid Cloud (i.e. both)" option won't be reflected in the Private or Public cloud numbers. Where answers for "Multi-cloud" are the only response selected, they won't be included in the "Public Cloud" numbers. The above graph is a graph of literal responses, not a graph of usage.

If I go further back to 2018, the question had a different range of possible answers, namely "Private Cloud", "On Premise" and "Public Cloud", with no options for hybrid cloud or multi-cloud. This is actually an improvement on the 2019 and 2020 options, as the only way to respond that you are using public and private cloud was to explicitly select both, not the option of selecting "Hybrid" or selecting the other options.

What is also not disclosed in the above graph is the varying sample size. As this question was asked early in the survey a higher percentage of folks answered it. The 2019 survey had 1337 respondents, but this specific question had 1332 responses. In the 2020 survey, which had 1324 respondents, this question only had 1198 responses, a full 10% less than 2019.

If I work with the raw response data for 2018, 2019, and 2020, whilst dropping the non responses for this question, the more realistic picture of cloud usage is this.

This paints a different picture of cloud usage to the presented by the CNCF survey, whilst still using the same data.

Public, Private and Hybrid cloud usage is significantly higher, and all growing. I'm also able to break out Private only, and Public only usage, which is actually showing decline on both.

I'm able to get to these numbers by applying some logic to the survey responses.

  • Private  = If "Private Cloud" OR "Hybrid Cloud (i.e. both)"
  • Public = If "Public Cloud" OR "Hybrid Cloud (i.e. both)" OR "Multi-Cloud"
  • Hybrid = If ("Private Cloud" AND ("Public Cloud" OR "Multi-Cloud)) OR "Hybrid (i.e. both)"
  • Private Only = If "Private Cloud" AND no other answers
  • Public Only = "Public" computed answer above SUBTRACT "Hybrid" computed answer above

If you compare 2019 versus 2020 usage in the 5 categories I've calculated above there is a very small difference across all of them, with only Hybrid cloud changing by more than 2%. If we then also factor in the sample sizes for this data, with the 2020 sample size being 10% smaller than the 2019 sample size, we'll realise the differences between 2019 and 2020 are statistically insignificant.

Again with public cloud adoption we're seeing a flatting of the adoption curve in line with a production adoption s-curve as maturity grows.

Other Issues

There are other issues with the survey as a whole.

No questions are asked in the 2019 and 2020 survey about ways to run containers outside of CNCF certified Kubernetes platforms and distributions. If an organisation is running a distribution of Kubernetes on a cloud provider, the survey will only gather data on the distribution used, not the cloud provider being used. If a cloud provider has multiple ways to run containers, the questions are only about the cloud provider's CNCF certified Kubernetes offering, not the provider's other container offerings.

This might be reasonable considering it is a CNCF survey. Except the survey grew out of a general container survey, and is still presented as a container survey in the way the findings are presented in the report, even though there are only two questions in the 2019 and 2020 surveys on containers beyond Kubernetes.

There is also the strange questioning on serverless platforms. It lists a number of self-hosted and hosted serverless platforms, including platforms now defunct, but not including platforms like Google Cloud Run. Why would the survey not ask questions around AWS Fargate or Azure ACI (container platforms), but ask questions around AWS Lambda and Azure Functions? Considering the CNCF itself also has no serverless platforms under incubation, and excludes non-CNCF container platforms from it's survey, even having a serverless section to the survey seems strange.

In Conclusion

What I've covered in this post is a non-exhaustive list of the issues with the survey and reports. The validity of the conclusions the survey report draws must be questioned. The survey structure and questions have a number of issues that need to be addressed, and the sample size of the report is dropping year on year.

My recommendation to the CNCF  to improve future reports is

  • Involve a professional research organisation in the production of this report
  • Actively work to expand the sample size
  • Work to improve questions where the responses don't include all possible answers
  • Remove any potential responses that are not distinct (i.e. don't provide more than one way to answer the same thing)
  • Expanding the questions on container platforms beyond Kubernetes, and cloud platforms folks run Kubernetes on.
  • Either fixing the questions on serverless platforms (ie removing unused platforms, adding new platforms) or removing the serverless section completely

The survey report could be a hugely valuable resource to understand greater trends, but the currently quality of the report undermines any potential value it might provide, and that needs to be addressed.

Data

CNCF Survey Data (2018 to 2020): https://github.com/cncf/surveys/tree/master/cloudnative

CNCF 2020 Survey Report: https://www.cncf.io/wp-content/uploads/2020/11/CNCF_Survey_Report_2020.pdf

CNCF 2019 Survey Report: https://www.cncf.io/wp-content/uploads/2020/08/CNCF_Survey_Report.pdf

CNCF 2018 Survey Report: https://www.cncf.io/blog/2018/08/29/cncf-survey-use-of-cloud-native-technologies-in-production-has-grown-over-200-percent/

My workings: https://docs.google.com/spreadsheets/d/1qGxMgj-Kmlb-sTDcMZv2nHYpwzSKFH3JJB3rnZJqmPM/edit?usp=sharing