Data science is growing up. As more businesses look to harness the power of big data and advanced analytics it’s time to start integrating development and data best practices. That’s been an elusive goal for many data science organizations, and it costs them in the long run. Everything from data breaches to rogue projects to model failures can be traced back to a lack of best practices. Here are a few tried and true best practices that it’s time to add to your data science department.
MDM & Data Governance
Data governance needs to cover the full lifecycle of collection, analysis, storage and disposal of data. There are a growing number of compliance issues, privacy, ethical concerns and costs that put data governance at the top of my data science best practices list.
Let’s start with compliance. In the US there are over 20 separate laws that deal with the handling of data. There’s no comprehensive standard which means that some of the data a business collects is held to higher standards of protection than others. Data collected needs to have a stated purpose. In Europe there are strict standards for data collection and use. Citizens have broad protections and the EU has been very active in pursuing data privacy cases against even the largest of companies. Without a compliance strategy, a business is asking for trouble.
The costs associated with collection, storage and processing of large datasets isn’t trivial. It’s an area that companies can look at for significant cost savings. Being smart about data logistics is all it takes to take advantage of those savings.
I think the hottest job segment within data science will be the data science product manager. Productizing data science initiatives is starting to yield big money for many companies. Data itself is being compared to currency and is a potential revenue stream. Especially in healthcare and finance, companies don’t really understand how valuable their data is. The data science capabilities they’re building internally have significant value as a service to other businesses.
That’s why I believe product management needs to be brought into data science initiatives. Companies are missing out on revenue and all they need to capitalize is someone to say, “We can monetize that!” The monetization strategies around data science aren’t traditional so it isn’t a simple matter of bringing a software product manager in. It’s a specialized hybrid, a unicorn among unicorns, which knows enough about data science to understand the projects, while also understanding the market well enough to identify and monetize opportunities. It’s worth the effort to find a good data science product manager because the ROI is so high.
Companies are starting to hire data quality engineers and quality data scientists. The cost of defects in a data science system can be much worse than the cost of defects in traditional software. That’s because the business is making critical decisions based on the insights that can impact revenue for years.
Software testing in many data science teams is currently handled by the data scientists responsible for writing the code and selecting the algorithms. Anytime the fox guards the henhouse there’s going to be trouble. A dedicated quality engineer avoids that conflict as well as freeing the data scientist to work on developing full time.
This is another segment of data science that I think will take off this year and next. Machine learning algorithms are tested using a variety of methods. Those tests look at accuracy, optimization and for pitfalls like over-fitting. Having engineers whose specialty is niched in model optimization and quality paired with those who can select, design and build models from scratch will save time while building a better model.
A lot of businesses group this in with data governance but it doesn’t belong there. It’s part of the system design and needs to be in the hands of security engineers. Data governance should have an oversight role (Does the security of the system meet with company requirements? What do we do when there’s a breach?). However the data science team needs an information security engineer. The system needs to be architected with security in mind. Look at Target or any of a number of large companies who’ve dealt with a large scale data breach. It should be an imperative in any data science team.
Data science isn’t a “wild west” technology anymore. It’s really grown up and matured over the last three years. As businesses are building or ramping up data science teams it’s a good time to think about how to build in the fundamental best practices of data science. Process and innovation are in a constant struggle so striking a balance between the overhead of process and the pace of progress is critical.
Vineet Vashishta is the founder of V-Squared Consulting, a leading edge data science services company. He has spent the last 20 years in retail/eComm, gaming, hospitality, and finance building the teams,infrastructure and capabilities behind some of the most advanced analytics companies in the US.
You can follow him on Twitter: @V_Vashishta.