DATA SCIENCE

We help customers identify data science opportunities within their organisation, develop Machine Learning and Deep Learning solutions to validate our customers business case and support them to bring these solutions to life. We are inquisitive by nature, and we love exploring solutions using state of the art technology and algorithms. Ultimately our goal is to deliver great Data Science solutions that meet the customer needs by keeping true to our scientific and analytical methodology.

Typically we deliver 4 key project types

OUR APPROACH

For each use case, Goaltech proposes a four-stage approach

Our discoveries and productionisation work are underpinned by data driven decisions. Thinking analytically and following Data Science best practices is in our DNA. When it comes to answering our customers’ questions, we never take things for granted. We are the first ones to admit “we don’t know what we don’t know” and allow the data to guide us towards the right answers.

Our ACTIVE principles underpin our work

Our data scientist team’s general capabilities

  • We all write code – good Python dev skills
  • We all have advanced DS / AI skills
  • We are all AWS and Azure certified

There are areas of expertise across the team

  • Computer Vision
  • Natural Language Processing ( NLP )
  • Time series and forecasting
  • Anomaly detection

DATA SCIENCE LIFECYCLE

Any project starts with a well-defined problem statement (like forecasting the sales of an X item present in its inventory in the coming month or the cause of customer churn) or a not well-defined problem (like how to increase sales of a product).

Data science enables us to solve this business problem with a series of well-defined steps. Generally, these are the steps we mostly follow to solve a business problem. All the terminologies related to data science falls under different steps which we are going to understand just in a while. Let’s look all these steps.

Step 1: Business understanding

The business need is the starting point in the life cycle. Hence it is important to understand what the problem statement is and ask the right questions to the customer that helps us understand the data well and derive meaningful insights from the data.

We have all the technology to make our lives easy but still with this tremendous change the success of any project depends on the quality of questions asked for the dataset.

Every domain and business work with a set of rules and goals. In order to acquire the correct data, we should be able to understand the business. Asking questions about the dataset will help in narrowing it down to correct data acquisition.

We typically use data science to answer five types of questions:

  • How much or how many? (regression)
  • Which category? (classification)
  • Which group? (clustering)
  • is this weird? (Anomaly detection)
  • Which option should be taken? (recommendation)

In this stage, you should also be identifying the central objective of your project by identifying the variables that need to be predicted.

A few right questions that other successful businesses have asked in the past of their data science teams:

  • What percentage of time do drivers actually drive? How steady is their income?
  • What is the average occupancy of mediocre hotels?
  • What are the per-square-foot profits of our warehouses?

All these questions are a necessary first step before we can embark on a data science journey. Having asked the correct question, we move on to collecting data.

Step 2: Collecting data

The primary step in the lifecycle of data science projects is to first identify the person who knows what data to acquire and when to acquire based on the question to be answered. The person need not necessarily be a data scientist but anyone who knows the real difference between the various available data sets and making hard-hitting decisions about the data investment strategy of an organization — will be the right person for the job.

Data might need to be collected from multiple types of data sources.

Few Examples of Data Source:

  • File format Data (Spreadsheet, CSV, Text files, XML, jSON)
  • Relational Database
  • Non-relational Database (NoSQL)
  • Scraping Website Data using tools

Our first terminology, BIG DATA, fits here. Big data is nothing but any data which is too big/complex to handle. Big data does not necessarily mean data that is large in science. Big data is characterized by 4 different properties and if your data exhibits this property then it is qualified to be called Big data. These properties are defined by 4 V’s.

  • Volume: Data in terabytes
  • Velocity: Streaming data with high throughput
  • Variety: Structured, semi-structured, and unstructured
  • Veracity: quality of the data that is being analysed

In a retail business, a lot of transactions happen every second by many customers, a lot of data is maintained in a structured or unstructured format about customers, employees, stores, sales, etc. All this data put together is overly complex to process or even comprehend. Big data technologies like Hadoop, Spark, Kafka simplifies our work here.

Step 3: Cleaning data

Often referred to as the data wrangling phase as well. Data scientists often complain that this is the most boring and time-consuming task involving the identification of various data quality issues.

In this step, we understand more about the data and prepare it for further analysis. The data understanding section of the data science methodology answers the question: Is the data that you collected representative of the problem to be solved?

Cleaning data essentially means removing discrepancies from your data such as missing fields, improper values, setting the right format of the data, structuring data from raw files, etc.

Format the data into the desired structure, remove unwanted columns and features. Data preparation is the most time-consuming yet arguably the most important step in the entire life cycle. Your model will be as good as your data. This is like washing veggies to remove the surface chemicals. Data collection, data understanding, and data preparation take up to 70% – 90% of the overall project time.

This is also the point where if you feel the data is not proper or enough for you to proceed you go back to the data collection step.

Step 4: Analysing data

Exploratory analysis is often described as a philosophy, and there are no fixed rules for how you approach it. There are no shortcuts for data exploration.

Remember the quality of your inputs decides the quality of your output. Therefore, once you have got your business hypothesis ready, it makes sense to spend a lot of time and effort here.

To understand the data a lot of people look at the data statistics like mean, median, etc. People also plot the data and look at its distribution through plots like histogram, spectrum analysis, population distribution, etc.

Now we create a plan to do analytics on the data. There can be different types of data analytics that can be performed on the data depending upon the problem at hand. Different types of analytics may include as below:

  • Descriptive Analytics (what has happened in the past?): We can use data aggregation methods tools to provide insights into what had happened in the past.
  • Diagnostic analysis (why something happened?): We can use it for a deep-dive or detailed data examination to understand why something happened. It is characterised by techniques such as drill-down, data discovery, data mining, and correlations. Multiple data operations and transformations may be performed on a given data set to discover unique patterns in each of these techniques.
  • Predictive Analytics (what could happen in the future?): We can use statistical methods and other forecast techniques including data mining and machine learning to understand and estimate what could happen in the future.
  • Prescriptive Analytics (what should we do?): We can use optimisation and simulation methods to make the decision and describe possible outcomes for what-if and if-what analysis

This step of the data science project lifecycle does not produce any meaningful insights. However, through regular data cleaning, data scientists can easily identify what foibles exist in the data acquisition process, what assumptions they should make, and what models they can apply to produce analysis results.

So, we first determine which type of analytics we intend to perform. This is part of data analytics. After getting structured data from the cleaning operations (which is generally the case), we perform the data mining operation in order to identify and discover hidden patterns and information in a large dataset. This is known as data mining.

For example, identifying seasonality in sales. Data analysis is the more holistic approach, but data mining tends to find the hidden patterns in data only. These discovered patterns are fed to data analysis approaches taken to generate hypotheses and find insights.

Step 5: Data Modelling / Machine Learning modelling

This stage seems to be the most interesting one for almost all data scientists. Many people call it “a stage where the magic happens”. But remember magic can happen only if you have the correct props and technique. In terms of data science, “Data” is that prop, and data preparation is that technique. So before jumping to this step make sure to spend a sufficient amount of time in prior steps.

Modelling is used to find patterns or behaviours in data. These patterns either help us in one of two ways:

  • Descriptive modelling (Unsupervised learning)
  • Predictive modelling (Supervised Learning)

Supervised Learning

Supervised learning is a technique in which we teach or train the machine using data, which is well labelled.

A few examples of Supervised Algorithms:

  • Naive Bayes
  • Random Forest
  • Neural Network Algorithms
  • k-Nearest Neighbour (kNN)
  • Linear Regression
  • Logistic Regression
  • Support Vector Machines (SVM)
  • Decision Trees
  • Boosting
  • Bagging

Unsupervised Learning

Unsupervised learning involves training by using unlabelled data and allowing the model to act on that information without guidance. Think of unsupervised learning as a smart kid that learns without any guidance.

A few examples of Unsupervised Algorithms:

  • PCA
  • KMeans/Kmeans++
  • Hierarchical Clustering
  • DBSCAN
  • Market Basket Analysis

Below are some of the standard practices involved to understand, clean, and prepare your data for building your predictive model:

  • Variable Identification
  • Univariate Analysis
  • Bi-variate Analysis
  • Missing values treatment
  • Outlier treatment
  • Variable transformation
  • Variable creation

Step 6: Model Evaluation

A common question that professionals often have when evaluating the performance of a machine learning model in which dataset they should use to measure the performance of the machine learning model. Looking at the performance metrics on the trained dataset is helpful but is not always right because the numbers obtained might be overly optimistic as the model is already adapted to the training dataset. Machine learning model performances should be measured and compared using validation and test sets to identify the best model based on model accuracy and over-fitting.

Based on the business problem models could be selected. It is essential to identify what is the task, is it a classification problem, regression or prediction problem, time series forecasting, or a clustering problem. Once the problem type is sorted out, the model could be implemented.

A few examples of Classification metrics:

  • Classification Accuracy
  • Confusion matrix
  • Logarithmic Loss(Log Loss)
  • Area under curve (AUC)
  • F-Measure (F1 Score)
  • Precision
  • Recall

A few examples of Regression metrics:

  • Mean Absolute Error (or MAE)
  • Mean Square Error (MSE)
  • Root Mean Squared Error (RMSE)
  • MAPE

The model should be a robust one and not an overfitted model. If it is an overfitted model, then predictions for future data will not come out accurately.

Step 7: Driving insights and BI reports

In this process, technical skills only are not sufficient. One essential skill you need is to be able to tell a clear and actionable story. If your presentation does not trigger actions in your audience, it means that your communication was not efficient. It should be in line with business questions. It should be meaningful to the organisation and the stakeholders. Presentation through visualisation should be such that it should trigger action in the audience. Remember that you will be presenting to an audience with no technical background, so the way you communicate the message is key.

A few tools used for visualisation purpose:

  • Tableau
  • Power BI
  • R – ggplot2, lattice
  • Kibana
  • Grafana
  • Spotfire
  • Python – Matpoltlib, Seaborn, Plotly

Step 8: Model Deployment

After building models it is first deployed in a pre-production or test environment before deploying them into production.

Whatever the shape or form in which your data model is deployed it must be exposed to the real world. Once real humans use it, you are bound to get feedback. Capturing this feedback translates directly to life and death for any project.

A few frameworks used for Model deployment:

  • Flask
  • Django
  • FastAPI
  • Popular and widely used cloud providers are,
  • AWS
  • Azure
  • Google Cloud

Step 9: Taking actions

Actionable insights from the model show how Data Science has the power of doing predictive analytics and prescriptive analytics. This gives us the power to learn how to repeat positive results, or how to prevent negative results.

Based upon all the insights we have gathered through observation of data or through machine learning model’s result, we get into a state where we can make some decisions regarding any business problem at hand.

Few Examples are:

  • How much stock of item X do we need to have in inventory? How much discount should be given to item X to boost its sales and maintain the trade-off between discount and profit?
  • How much attrition is predicted and what can be done to avoid the same?

Each step has its own importance and will be going through multiple iterations back and forth. Multiple people from different technical stacks will be working in coordination for making a successful deliverable.