“Prediction” is very important in analyzing big data of the business


It is a good timing to reconsider “Big data and digital economy” because this name of group on Linledin has four-month-history and more than 100 participants now. I would like to appreciate the cooperation of all of you.

In the beginning of 2000s, I worked in the risk management dept in the Japanese consumer finance company.   There is a credit risk model which can predict who is likely to be in a default in the company. I learned it more details and understood how it worked so accurately. I found that if I collect a lot of data about customers, I could obtain accurate predictions for events of defaults in terms of each customer.

Now in 2015,  I researched many algorithms and statistical models including the state of art “deep learning”.   While there are many usages and objectives in using such models,  in my view,  the most important thing for business persons is “prediction” just like my experience in consumer finance company because they should make good business decisions to compete in markets.

If you are in health care industry,  you may be interested in predictions about who is likely to be cured. If you are in sales, you may be interested in predictions about who is likely to come to the shop and buy the products. If you are in marketing,  you may be interested in who is likely to click the advertisement on the web.  Whatever you do,  predictions are very important for your businesses because it enables us to take the right actions.  Let me explain key points about predictions.



What are your interests to predict?    Revenue of your business?  Number of customers?    Satisfaction rate based on client feedback?  Price of wine near futures? You can mention anything you want.  We call it “Target”.  So firstly, “Target” should be defined in predictions so that you can make right business decisions.



Secondly,  let us find something related to your target.  For example,   If you are a sales person and interested in who is likely to buy the products,  features are “attributes of each customer such as age, sex, occupation” , “behavior of each customer such as how many times he/she come to the shop per month and when he/she bought the products last time”,  “What did he/she click in the web shop”  and so on.  Based on the prediction, you can send coupons or tickets to “highly likely to buy”customers in order to increase your sales.  If you are interested in the price of wine,  features may be temperature,  amount of rain and locations of farms,  and so on.  If you can predict the price of wine,  you might make  good investments of wine.  These are just simple examples. In reality,  a number of features may be 100,  1000  or more.  It depends on whole data you have.  Usually the more data you have, the more accurate your predictions are.  This is why data is very important to obtain predictions.


Evaluation of predictions

Finally by inputting features into statistical models,  predictions of the target can be obtained. Therefore, you can predict who is likely to buy the products when you think of marketing strategies.  This is good for your business as marketing strategies can be more effective.  Unfortunately customer preferences may be changed in the long run.  When situations and environments such as customer preferences are changed,  predictions may not be accurate anymore.  So it is important to evaluate predictions and update statistical models periodically.  No model can work accurately forever.


Once you can obtain the prediction,  you can implement processes of the predictions as a daily activity, rather than one-off analysis. It means that data driven decisions are made on a daily basis.  It is one of the biggest aspects of “digital economy”.  From retail shops to health care and financial industry,  predictions are already used in many fields.  The methods of predictions are sometimes considered as “black-box”.  But I do not think It is good to use predictions without understanding the methods behind predictions. I would like to explain them in my weekly letter in future.  Hope you enjoy it!



Note: Toshifumi Kuga’s opinions and analyses are personal views and are intended to be for informational purposes and general interest only and should not be construed as individual investment advice or solicitation to buy, sell or hold any security or to adopt any investment strategy.  The information in this article is rendered as at publication date and may change without notice and it is not intended as a complete analysis of every material fact regarding any country, region market or investment.

Data from third-party sources may have been used in the preparation of this material and I, Author of the article has not independently verified, validated such data. I accept no liability whatsoever for any loss arising from the use of this information and relies upon the comments, opinions and analyses in the material is at the sole discretion of the user. 

“Classification” is significantly useful for our business, isn’t it?


Hello, I am Toshi. Hope you are  doing well. Now I consider how we can apply data analysis to our daily businesses.  So I would like to introduce “classification” to you.

If you are working in marketing/sales departments, you want to know who are likely to buy your products and services. If you are in legal services, you would like to know who wins the case in a court. If you are in financial industries, you would like to know who will be in default among your loan customers.

These cases are considered as same problems as “classfication”.  It means that you can classify a thing or an event you are interested in from all populations you have on hand.  If you have data about who bought your products and services in the past, we can apply “classification” to predict who are likely to buy and make better business decisions. Based on the results of classification,  you can know who is likely to win cases and who will be in default with a numerical measure of certainty,  which is called “probability”.  Of course, “classification” can not be a fortune teller.  But “classification” can provide us who is likely to do something or what is likely to occur with some probabilities.  If your customer has 90% of probabilities based on “classification”, it means that they are highly likely to buy your products and services.


I would like to tell several examples of “classification” for each business. You may want to know the clues about the questions below.

  • For the sales/marketing personnel

What is the movie/music in the Top 10 ranking in the future?

  • For personnel in the legal services

Who wins the cases ?

  • For personnel in the financial industries or accounting firms

Who will be in default in future?

  • For personnel in healthcare industries

Who is likely to have a disease or cure diseases?

  • For personnel in asset management marketing

Who is rich enough to promote investments?

  • For personnel in sports industries

Which team wins the world series in baseball?

  • For engineers

Why was the spaceship engine exploded in the air?


We can consider a lot of  examples more as long as data is available.  When we try to solve these problems above,  we need data in the past, including the target variable, such as who bought products, who won the cases and who was default in the past.  Without data in the past, we can predict nothing. So data is critically important for “classification” to make better business decisions.   I think data is “King”.


Technically, several methods are used in classification.  Logistic regression,  Decision trees,  Support Vector Machine and Neural network and so on. I recommend to learn Logistic regression first as it is simple, easy to apply real problems and can be basic knowledge to learn more complex methods such as neural network.


I  would like to explain how classification works in the coming weeks.  Do not miss it!  See you next week!

This course is the best for beginners of data analysis. It is free, too!


Last week, I started learning on-line course about data analysis. It is “The Analytics Edges” in edx, one of the biggest platforms of MOOCs all over the world (www.edx.org).  This course says “Through inspiring examples and stories, discover the power of data and use analytics to provide an edge to your career and your life.”   Now I completed Unit one and two out of  total nine in the course and found that it is the best course for beginners of data analysis in MOOCs. Let me tell you why it is.


1. There are a variety of data sets to analyze

When you start learning data analysis, data is very important to motivate yourself to continue to learn.  When you are sales personnel, sales data is the best to learn data analysis because you are interested in sales as professional.  When you are in financial industries, financial data is the best for you.   This course uses a variety of data from crime rate to automobile sales.  Therefore, you can see the data you are interested in. It is critically important for beginners of data analysis.


2. This course focuses on how to use analytics tools, quite than the theory behind the analysis

Many of data analysis courses take a long time to explain the theory behind the analysis.  It is required when you want to be a data scientist because theory is needed to construct an analytic method by yourself. However, most of business managers do not want to be data scientists.  All business managers need is the way to analyze data to make better business decisions. For this purpose, this course is good and well-balanced between theory and practice.  Firstly, a short summary of theory is provided, then move on to practice. Most of  the lectures focus on “how to use R for data analysis”. R is one of the famous programming languages for data analysis, which is free for everyone.  It enables beginners to use R in analyzing data step by step.


3. It covers major analytic methods of data analysis.

When you see the schedule of the course,  you find many analytic methods from linear regression to optimizations.  This course covers major methods that beginners must know.  I recommend to focus on linear regression and logistic regression when you do not have enough time to compete all units because both of method is applicable to many cases in the real world.



I think it is worth seeing only the video in Unit 1 and 2.  Interesting topics are used especially for people who like baseball. If you do not have enough time to learn R programming, it is OK to skip it. The story behind the analysis is very good and informative for beginners. So you may enjoy the videos about the story and skip videos of programming for the first time. If you try to obtain a certificate from edx, you should obtain 55% at least over the homework, competition and final exam.  For beginners, it may be difficult to complete the a whole course within limited time (three-month).  Do not worry.  I think this course can be learned again in time to come.  So first time,  please focus on Unit1 and Unit2, then a second time, try a whole course if  you can. In addition, most of edx courses including this are free for anyone.   You can enjoy anytime, anywhere as long as you have an internet access.  Could you try this course with me (www.toshistats.net) ?

I started Microsoft Azure ML. It is definitely amazing!




Finally, I started MS (Microsoft) Azure ML (Machine Learning).  So in this blog,  I would like to report what it is and why it is amazing for not only data scientists but also businessmen/women. MS Azure ML is a kind of ML services on the cloud. It is easy to start data analysis by ML, even for beginners.  For data analysis, it is critically important to have seamless processes  1. Data 2. Models 3. Output. Unfortunately, most of ML services are provided as an independent one from other services, therefore users should gather data and inform results of data analysis of stakeholders and management, one by one, independently, outside ML services. However, when we see the portal of MS Azure, Machine Learning is built on as one of the functions in MS Azure.   So we can operate this ML as one of the processes in MS Azure. It is completely different from other ML services.  Then let me go to  MS Azure ML studio and look at major functions in details.

After creating ML working space, we can go to ML studio where experiments can be done by using Graphical User Interface.

1.  Data

More than 30 data sets,  for example census income data,  are set up in advance.  So beginners can start data analysis immediately for training.  It is good because they can concentrate on data analysis in MS Azure ML.  Data, which are analyzed, should be just dragged and dropped into experiment area.  So data can be handled with their intuition.  No need to read manuals in advance.


2. Predictive models

In ML studio, there are more than 10 predictive models for classification.  Logistic regression, neural network and SVM., etc. are available here.  Models for regression and clustering are also available. According to the documents,  more than 300 R packages, which are open source in R language, are  also available.  It is amazing that these models can be used by drug and drop in ML studio without writing code. So beginners can analyze data without coding the models.


3. Output

Once data analysis is completed and predictive models are developed,  it is easy to release it as web application services by clicking the buttons to deploy it in the web. It is usually difficult to explain how predictive models work  just by theory.  Web applications must be powerful tools to explain how the models work to stakeholders, managements and customers because web applications can show us the results based on inputs from users.


As I said before, it is critically important to have seamless process 1. Data  2.  Model  3. Output.  Microsoft Azure ML realizes this as a cloud service. I would like to develop interesting web services based on Machine Learning in the future. The current version of  MS Azure ML is a preview,  so functionalities might be changed or removed, added going forward. If you need more information about MS Azure ML, please refer to this web. Let us enjoy machine learning !

Can Machine Learning solve today’s problem? Maybe yes, because…



Machine Learning is a hot topic this year.  Machine learning is defined as follows,  Field of study that gives computers the ability to learn without being explicitly programmed, Arthur Samuel (1959).  As a lot of data and computer resources are available with less costs recently, Machine learning is getting popular in the field of data analysis.

In academics, there is no doubt that Machine learning has a good performance in statistical computing.  Then how about the real world?  When we try to apply Machine learning to data analysis on a daily basis,  there are two difficulties to cope with. One is making data sets for training and the other is implementations of the models on computers in order to obtain the results from the models. As the name of  ‘learning’ suggests, the training data set is required so that computers can learn the data before models generate the results from observed data. In order to implement these processes,  knowledge and expertise about data analytics are required to complete the tasks. It must take one week or one month depend on the availability of resources of data scientists.

So I thought it might be difficult to solve today’s problem within today before I have heard the announcement from Microsoft on 16 June 2014.  This is about Microsoft Azure ML, which is Machine learning statistical tool operated on its platform “Azure“.  Azure is one of the platforms on the cloud.  So it competes with Google apps for business and AWS.  Although the details is not disclosed yet,  it looks like that Azure ML is better than other analytics tools in order to establish seamless processes from preparing the data set to model implementation on computers,  because Azure itself is a seamless process platform and Azure ML is a part of them, not exists independently. So users do not need to pay attentions to the relationship among each independent component in the platform.  All they have to do is just to consume data and obtain the results.  It means that we can go short cut in analyzing data on a daily basis.  It is critically important because the quick response against change of business environment is required in data driven management. Microsoft says that the preview of  Azure ML will be started in July.  It must be exciting and must enhance the data driven management.

Once Machine learning is getting a user-friendly tool,  what should business managers do?  I think  it is very important to realize what data around us is available and will be in future.  Data is a starting point for data analysis and data which is available has increased exponentially.  ‘The data we create and copy annually’  is doubling in size every two years  from 2013 to 2020 according to the research conducted by IDC.   Yes, we should be data savvy managers as Machine learning stands by us! Continue reading

Is it a new star in MOOCs ?


Followed by last week’s blog,  I would like to write MOOCs again because I found the potential new star in MOOCs yesterday.   This is “Nanodegree” provided by Udacity, one of the big names in MOOCs.   Unlike Edex and Coursera,  two big names of MOOCs,  Udacity have a close relationship with industries, such as AT&T,  Facebook and Google.  Therefore “Nanodegree” courses will provide project based programs and make participants be ready to be hired.  According to the blog of Udacity,  AT&T is offering 100 paid internships to top graduates of the nanodegree program and will consider students with nanodegrees when there is a potential job match. Nanodegree is a trademark of Udacity.


Comparing other courses that I took before in MOOCs, nanodegree might have advantages to them.

1.  It is focused on “being ready to be hired”

It is always said that current higher education cannot meet the demand from industries so there is a gap between knowledge and skills of college graduates and ones required by employers.  Although the details of the courses are not disclosed, it is expected to be based on activities and operations in industries, rather than academics.  From the standpoint of employers,  it is good that Nanodegree is created for the needs of industries, so that It has far less mismatch to the needs of the industries.

2.  It is not free, however, still far cheaper than on-campus courses.

It will cost $200 USD per month and it takes from 6 to 12 months to complete the courses.  Compared with the on-campus courses, it is much cheaper.  I think one of the reasons not to be free is that coaching is available to participants during the courses.  It is considered to be necessary for participants to learn the courses effectively and be motivated through the courses.

3. It takes less than one year to finish the courses and obtain the certificates

According to Udacity,  it takes from 6 to 12 months to complete the courses when participants use 10 hours per week for the course.  So it may be completed less than 6 months when participants use more than 10 hours per week to learn the course.  It is very good, especially for participants who are already working.  Although course sequences are also available In Edex and Coursera (X-series in Edex, Specialization in Coursera),  some of them can not be completed within 6 months. Business environment changes so quickly so the shorter the course is,  the better it is for participants.


According to Udacity blog, nanodegree is expected to start this fall in 2014.  There will be five courses, front-end web developer, back-end web developer, iOS mobile developer, Android mobile developer and data analyst. I would like to take one of them and obtain the expertise of it. It must be exciting!


Do you want to be “Analytic savvy manager”?

Data is and will be around us and it is increasing at an astonishing rate.  In the such business environment,  what should business managers do?    I do no think every manager should have an analytics skill at the same level as a data scientist because it is almost impossible.  However, I do think every manager should communicate with data scientists and make better decisions by using output from their data analysis.  Sometimes this kind of manager is called “analytic savvy manager”.  Let us consider what “analytic savvy manager”should know.


1.   What kind of data is available to us?

Business managers should know what kind of data is available for their business analysis.  Some of them are free and others are not.  Some of them are in  companies or private and others are public.  Some of them are structured and others are not.  It is noted that data which are available is increasing in terms of volume and variety.  Data is a starting point of analysis, however, data scientists may not know specific fields of business in detail. It is business managers that know what data is available to businesses.  Recently data gathering services have provided us a lot of data for free.  I recommend you to look at “Quandl” to find public data. It is easy to use and provides a lot of public data for free.  Strong recommendation!


2.  What kind of analysis method can be applied?

Business managers do not need to memorize formulas of each analysis method.  I recommend business managers to understand simple linear regression and logistic regression and get the big picture about how the statistical models work. Once you are familiar with two methods,  you can understand other complex statistical models with ease because fundamental structures are not so different among methods.  Statistical models enable us to understand what big data means without loss of information.  In addition to that,  I also recommend business managers to keep in touch with the progress of machine learning,  especially deep learning.  This method has great performances and is expected to be used in a lot of business field such as natural language processing.  It may change the landscape of businesses going forward.


3.  How can output from analysis be used to make better decisions?

This is critically important to make a better decision.  Output of data analysis should be in aligned with business needs to make decisions.  Data scientist can guarantee whether numbers of the output are accurate in terms of calculations. However, they can not guarantee whether it is relevant and useful to make better decisions.  Therefore business managers should communicate with data scientist during the process of data analysis and make the output of analysis relevant to business decisions. It is the goal of data analysis.


I do not think these points above are difficult to understand for business managers even though they do not have a quantitative analytic background.   If you are getting familiar with these points above, it would make you different from others at the age of big data.

Do you want to be “Analytic savvy manager”?