For companies newly endeavoring in establishing capabilities in Data Science, it is important to keep a few crucial points in mind. Clean data, applicable models, and business intuition are all key to success. Do not remove any of them from the equation. Data Science is essentially about identifying and/or creating the cleanest possible data set, and then searching mathematically for patterns within it. The goal should be to help business users make important data-driven decisions, prove or disprove their intuitions, predict the future, or optimize outcomes and processes. The following is an introduction to what it takes to establish an Analytics Center of Excellence in a systematic way.
Brainstorm & Prioritize
There is a clear difference in the approach followed by Data Scientists and business users, so it is vital that everyone is on the same page before embarking on a first project. Start by convening a stakeholder discussion with as many departments as possible, and brainstorm on problems that can be solved. For instance, if you are working with customer data, think about things such as analyzing patterns of good and bad customers to predict churn or identify opportunities for cross-selling and upsell. If you are working with supply chain data, think about things such as demand forecasting or inventory optimization.
Once opportunities have been identified, it is important to prioritize them by complexity, time to solve, and ROI. Though a complex problem may seemingly have more value, it is often best to start with low-hanging fruit. It is easy to lose business support when it takes several months to deliver a solution, so it is important to split projects into phases, each with demonstrable value. This is especially important for companies starting out in analytics, as they may not have the technical capability or internal support to invest significant money and time.
Agree on a Scope
Once there is a firm understanding of the business problem and desired outcome of the project, it is important to agree on an initial scope. An understanding of how the business will consume the output, including a precise business definition of the desired results and how they should be reported, must be determined. Also, consider how the project is going to be implemented into your existing workflow. Is the output going to be a dashboard, simple visualizations, or a score? Is it going to be a standalone application or is it going to be integrated into existing infrastructure? How are business users going to consume the output?
Infrastructure & Tools
There are many canned products for analytics, however, there has been an ongoing shift towards open source solutions. Relevant policies and governance are now well-established, so this should no longer be a business concern. The benefits of this approach include no licensing fees, access to more algorithms, and access to the latest and greatest packages being developed by researchers and professors. Actively encourage your teams to explore open source options such as R and Python. These provide the benefits mentioned above and can provide all the functionality, if not more, than out-of-the-box tools.
Assemble a Team
When assembling a team house, or while working with a partner, figure out what skill-sets everyone brings to the table.
An ideal Data Scientist has a combination of programming skills, math and stats knowledge, and some domain expertise, but finding someone who fits all three categories is almost impossible. Most Data Scientists have a background in Computer Science, Engineering, Math, or Statistics – which means they likely are not adept at communication. Thus, it is typically advisable to also include a business user that can logically explain the results. A solid analytics team should include at least three members: a programmer, a math guy, and a domain expert (that can creatively tell stories). It is also good to have a senior Data Scientist for particularly tricky problems or during the architecting phase.
Cleanliness is King
Complexity and ROI are both important factors for moving forward with an analytics project, but the most important component is data quality. Recognize that 50-60% of the time will go into cleaning data, and input from business users is very important during this process. A great model built on bad data will almost always lose to an average model built on great data.
Data cleaning and munging are vital steps in any analytics project. During this stage, patience is key, and you should trust your team. Just because a data set needs to be “cleaned” does not mean that it is of poor quality. Even the most rigorously maintained data sets will need to undergo pre-processing before a model can be built on top. This process involves checking the quality of data, understanding it through descriptive statistics, and ascertaining the need and use of additional engineered features and external sources.
When collecting new data, it is important to focus on reliable recording, governance, and storage in an easily accessible centralized location.
Model Building & Validation
Most analytics problems can be broken down into classification, clustering, forecasting, pattern recognition, or optimization. Once a problem is categorized, it is time to think about the types of models that are best suited. For example, if you are looking to implement a predictive analytics use case and require explicable patterns, it is best to stick with linear models such as Logistic Regression or Decision Trees. If explicability is not important and you only need a high-accuracy black box prediction, consider more advanced, non-linear techniques such as Random Forests or Deep Learning.
When building a model, it is important to be able to test and validate your model. For any model that is predicting a result, such as classification, this involves splitting your initial data set into training and testing data. Train your model and test it on held-out data (usually 20-30% of the data). This can be complemented by repeating this process by what is called cross validation (creating multiple sets of test and training data by randomly selecting data for your test sets). Other models, such as time series, require more effort than that mentioned above due to the ordered nature of the data. Sometimes the solution you are developing does not lend itself to easy validation. As such, it is the responsibility of the team to rigorously address these issues and determine alternative options.
Solution Delivery
Once a fully vetted solution has been developed, it is important to report the results and insights gained from the project. This should include necessary documentation including assumptions behind the models and modelling process as well as practical guidelines on how to run and manage the model.
If the output of the solution is intended to be used by non-technical folks, training to ensure the solution is fully functional and usable should also be undertaken. This process is as important as the development itself. If the end user cannot use the model, then it does not matter how good it is. To ensure that this does not happen, the end user should be included in the process of formulating the usability of the model, and it should be tailored to their needs.
Building an Analytics Center of Excellence
Soothsayer can work with your business and technical teams to provide in-depth training and a blueprint for short, medium, and long-term analytics success. Based on brainstorming sessions with your internal stakeholders, Soothsayer will develop a custom and comprehensive blueprint of the analytics problems that can be solved with maximum impact; the recommended prioritization of the problems; and architecture for tools, data management and governance to ensure that data-driven decision making can be carried out efficiently. We will also provide recommendations on tools to be utilized, processes to be followed while building analytics solutions, and guidelines on how to assemble effective teams for executing these solutions. We also provide information specific to the identified problems including problem definition; relevant data sources; high-level solution architectures; the format and nature of the output from an IT-perspective; and the processes to be followed to ensure that business users are able to consume the results.
Soothsayer will also identify required technical talent, train them in advanced problem-specific methods, and ensure that business users possess the skills to actively guide and measure their success. We can also audit your existing data sources, identify important data that is not currently being stored, and recommend relevant external sources. We can help you understand where data comes from, its interconnectedness and importance, and provide suggestions for optimal storage (data warehouses, lakes, cloud, etc.).
Summary
Soothsayer Analytics has cross-industry experience and expertise in best practices. All of our Data Scientists hold a PhD or Masters and come from varied backgrounds such as Rocket Science, Physics, Computer Science, Statistics, and Engineering. This multifarious assemblage of talent allows us to successfully solve any data science problem and work with any data source.
We are well-versed in state-of-the-art Artificial Intelligence techniques but realize that most compan
If you would like to explore how Soothsayer can help your company become more data driven, visit us at www.SoothsayerAnalytics.com, call us at 1-844-44-SOOTH, or e-mail us at info@soothsayeranalytics.com