Last 4yrs of StackOverflow survey-data analysis
Introduction
In this blog I’m going to give an overview of the end to end data analytics project I recently did on analyzing last 4yrs of StackOverflow Survey data. Here’s the Github link of the code and the Tableau dashboard for you to analyze the data.
Note : If you have Tableau Desktop then you can download the .twbx file and perform your analysis locally.
For Data Collection, we’ll take the last 4yrs of Combined data from Kaggle and build our dataset. The dataset was fairly large with over 300k+ survey responses(rows) and 15 questions (columns). We load it in CSV Format and start cleaning and preprocessing it.
Once we’re done with that we’ll start performing EDA and statistical analysis on data using Tableau & Python respectively. Finally we’ll summarize everything in a Story so that it can be showcased in a business setting.
Now, this project can be really helpful for both freshers and working professionals as they can replicate this project or add their own flavors to this project and put it on their Resume/CV.
I will NOT be showing/explaining the code to you , but will be providing you with insights that I got out of the data. Having said that, I’ve tried my best to document the code as much as possible so that no one has an issue understanding it. Also I will be happy to solve all your doubts over in the comments section Or you can connect and ping me on LinkedIn :)
Prerequisites
This project requires knowledge of —
- Python (pandas, numpy etc. the whole data analysis stack)
- Intermediate Statistics
- Tableau Public
Data Preparation
You can find all the code here .
As we had combined/unioned the last 4yrs of stackoverflow survey data, there was a lot of instability in it. For example —
- The Hobby Column is either ‘Yes/No’ for all years except in 2017, where it is either ‘Yes / No / Yes, I program as a hobby / Yes, I contribute to open source projects/ both’ , so they had to be formatted properly.
- Similar changes were done in almost all the columns as in different years the format of answers in survey had changed.
- Some columns contained multiple values for each row/record. Like the ‘DatabaseWorkedWith’ column consisted of multiple databases that a respondent was currently working with, so we had to use ‘itertools.counter’ module from python to store frequency of each database separately in a key-value pair for every year.
- There were not many missing values except for Salary, which had 50% missing values therefore making it unfit for any analysis. Distribution of salary was right skewed as well so ‘Group median Imputation’ method was performed on that data, grouping by years.
In this method we perform group based median imputation for our data, instead of imputing with just one median. Grouping was done by years as salary trends change a lot with time (Ex- Covid situation impacted salaries all over the world). - All data was finally exported to a csv, for Exploratory Data Analysis and Visualizations.
Exploratory Data analysis
For the EDA, we’ll ask questions from the survey data & answer them with the visualizations. I highly encourage you to keep the Tableau workbook open so that you can interact with all the visuals/filters/parameters in real time and do your analysis.
Employment types of Survey Respondents
We can see that around 70% of the respondents are Employed full-time in 2020. Similar pattern can be seen for all the years as well.
The key thing to notice over here is that the student respondents are more than 12% which tells us that Covid situation might have increased the participation of students onto these E-Learning Platforms.
Do People Code as a Hobby? Any effect because of Type of Employment?
We can see around 80% of survey respondents for almost every employment category are pursuing coding as a hobby. Largest percentage (90%) being of people who are ‘Not Employed, and not looking for work’ which is intuitive as people have to be really passionate about coding to keep doing it even if it doesn’t makes them money.
Are more respondents from Bigger Organization? Do they have higher/lower salary as compared to others?
Most of the people over 4yrs haven’t stated their organization, this is awkward as more than 70% of the people were employed full-time(for all 4 years). More research should be done on why people are shying away from telling their organizations.
Also, average salary of large organizations is significantly more than other organizations. We can see that organizations of more than 5k size have more than $100k annual C.T.C . Freelancers earn around $80k which is a decent amount as well. (at par with people working in small startups)
Does type of UnderGraduation Major has an effect on avg. Salary ?
Strangely we see that the highest earning degree is that of Fine arts/humanities rather than CS/Engineering, probably because they are able to build up on their existing skills and add coding to that. People who’ve never declared a major have a mean salary greater than many who did pursue a major which is evident of the fact that skills matter the most .
Lowest mean salary is of those who didn’t mention the subject of their major.
Salary & People Participation in Survey(over the yrs)
We can see an increasing trend in Average salary until after 2019 when the Covid-19 hit and there’s a sharp decrease in mean salary of IT Professionals all around the world, coming down to $80k (almost 20k decrease).
Coming to people participation, we can see a huge spike in no. of respondents in 2018(around 99k) after which it keeps on falling, latest (2020) count being a little over 64k. We should dive a little more into what would have happened like:
- The Survey was too long for people to complete.
- Maybe the english was a little bit complex for respondents as the survey was filled by people from all around the world whose first language is non-english.
- The questions were too personal for some people.
- The option to fill the survey form is not being shown to everyone (some mandatory reputation level issue in stackoverflow) etc.
Many other reasons could exist. The main aim should be to fix these issues to make these surveys successful which is only possible if more & more people participate.
What are the most common type of Developers in Survey?
Full Stack, Back-End and Front-End developers constitute for most of the survey respondents, so Stackoverflow is full of web developers. Data Science and Data analysts professionals count is around 22k and 18k respectively.
There’s TOP N parameter at the left that you can use to filter out the top n(based on count) types of developers.
Most famous Languages over-all
The visual above shows Top 10 Languages over a span of 4yrs. ( Again, you can control the Top n to Top 5 ,Top 3 etc. using the parameter at the top left).
We can see that HTML/CSS has topped the charts with most no.of users whereas Javascript has the highest amount of people who desire to use it.
Python and SQL are also performing good with not only a lot of people already working with it but also a lot desiring to use it as well. Python specifically has more no.of people who desire to learn it(111k) than those already using(104k) probably because of the huge increase in demand of data science skills.
Most Famous Languages Year-wise analysis
If we have a look at the latest year (2020,you can control this as well using filters on top left) , then Python seems to be the most popular language amongst all, with over 6% more people wanting to learn it than those already working with it. Javascript remains the most applied language in the Industry of all, with a lot of demand as well.
Similarly GO, Rust, Kotlin are also becoming very popular among others.
VBA seems to be decreasing in demand with also a very small amount of existing users.
Most Famous Databases Year-wise analysis
Similarly for Databases we can see that MySQL has the most number of users, probably because MySQL is mostly used in web development and the most of the survey respondents were web developers.
For the recent year 2020, we can see that there’s a lot of desire to learn PostgreSQL , probably because it is designed specifically to work with large datasets, making it a perfect match for Data Science and Analytics.
(Note: PostgreSQL had a upward trend in the previous 3yrs as well!)
Additionally, there’s an increase in desire to learn the No-SQL databases such as MongoDB and Elasticsearch as there is an increase in number of Businesses using Unstructured data for increased agility, performance and scale in the last decade.
Global Contribution to Stackoverflow community
In 2020, Most of the developers are from U.S.A(12.5k) and India (8.5k) with average salaries of $155k and $46k respectively. Also if you hover your mouse onto a specific region you can find the count of respondents/Devs & salary from that Country.
Moreover, if you click on a country, the Wikipedia page will open in a new tab showing info about that specific country.
Final Dashboard
This final dashboard summarizes our story containing a few visuals that we’ve already seen and some new ones as well.
- We can see that the highest average salary is of those people who’re very satisfied with their job, now these 2 factors could be highly correlated, as people who are earning more are more satisfied , and reverse causality could also be present as people who are very satisfied with the type of work they do, are putting in more mind & heart to their work, hence being offered higher salary (on average).
- Most of the respondents in the latest year have done a bachelors only.
- Highest average salary is of those who have a Doctoral Degree (such as PhD etc.).
- Also, People who don’t have a formal education are also earning a decent amount ($75k) which tells us that skills are the most important factor to thrive in tech.
The Dashboard is interactive , so clicking on one will show filtered values for others (except the line chart, which shows values for all years), so I encourage you to play with it.
Statistical Analysis
We Perform some statistical analysis and tests on the Salary Column.
As we can see that the salaries are a lot right skewed, and if we wanted to build something like a confidence interval , it would require a normal distribution.
So we transform the distribution to a Normal Distribution by using the Bootstrapping method, thereby building a Sampling Distribution of Sample means of Salary.
Note: If you want to get too technical here, Here is the link to the notebook with all the code. Everything has been well commented for you :)
- As per the survey data, it was found that there’s a 95% chance that the mean salary(C.T.C) of IT Professionals around the world would lie between $79,500 to $80,100 . (used Z-distribution as we had a large enough sample size of n=1000)
- According to ‘BWPeople.in’, the mean salary of IT professionals in the world is $89,732.
- We tested this as the null hypothesis for our data, the results of our z-test were significant and we reject the null hypothesis, suggesting the alternate hypothesis, that is, mean salary of IT Professionals is smaller than as stated on the website.
Pitfalls of our analysis
Although we’ve got a lot out of our data it’s not like we can blindly trust it, main reasons being —
- We have combined responses from over the last four years , so it’s highly possible that a respondent from previous year might have participated in the following year as well. This means that our samples might not be fully independent.
- The sample data we have might not be a good representative of the overall IT population as the survey was exposed to only those people who are active on Stackoverflow but not any other online platform. ( One more good question would be to ask that were all people on stackoverflow asked to fill the survey? or only a part of it? if yes, then what section? )
- Survey Weights were not used in the analysis
- The Imputation method we used (for salary) was pretty basic single imputation method, better ones would be MICE (a multiple imputation method) , Regression/KNN imputation, Hot deck imputation etc.
Conclusion
So this was it for my second portfolio project on a large survey data. I’ll write about more end to end projects in the coming days. This will be a good project for freshers & professionals looking to make a transition as this project includes the whole pipeline of a data analytics project (from taking a large dirty unstructured data to publishing the analytic reports).
Hope you liked the article! Please leave your valuable feedback or any questions that you have below.