Challenge to Machine Learning

home-office-336373_640

Machine Learning is getting famous and attractive to analyze big data.  It has a long history to be developed as the algorithm since 1950s.  However machine learning gets a spotlight among data scientist recently because  a lot of data,  computer resources and data storage, which are necessary for machine learning, has been available with reasonable costs.    I would like to introduce a basic algorithm of Machine Learning by using R language,  which I recommended before.

1.  Problem sets

Observed data x= [1,2,3] and y= [5,7,9].  Then I would like to find what are a and b when I assume that it can be expressed  y=ax+b.  Yes, it is obvious that a=2 and b=3, however, I want this solution by using algorithms to calculate them.

 

2. Algorithm

This is my program of machine learning to find what  a and b are.  I would like to focus on Bold part of the program.

First step      :       update parameters

Second step :       calculate the updated value of the cost function by the updated parameters

Third step    :       compare the updated value with the old value of the cost function and stop calculation if it is considered as convergence

Go back to the first step above.

These three steps above are generally used in machine learning algorithms. So it is useful if you can remember them.

 

ML<-function(le){

x=matrix(c(1,1,1,1,2,3),3,2)
y=matrix(c(5, 7, 9),3,1)
t=matrix(1,2,1)
m=length(y)
h=x%*%t
j=1/(2*m)*(t(h-y)%*%(h-y))

for (i in seq(1,1000)){
h=x%*%t
tnew=t-le/m*t(x)%*%(h-y)
hnew=x%*%tnew
jnew=1/(2*m)*(t(hnew-y)%*%(hnew-y))
if (abs(jnew-j)<=10^(-8)) break
t=tnew
j=jnew
print(i)
print(t)
end
}
}

 

3.  The result of calculation

I use  le=0.1 as a learning rate.  Then I get the result of the calculation below.

[1] 521

[,1]
[1,] 2.997600
[2,] 2.001056

This means that the value of the cost function is convergent at 521 time calculations.  a = 2.001056 and b =  2.997600.    They are very close to true values a=2 and b=3.  So it is considered that this algorithm can find the solutions.

 

This algorithm is one of the most simple ones. However, it includes the fundamental structure which can be applied to other complex algorithms.  So I recommend you to implement this by yourself and be familiar with this kind of algorithms.  In short, 1. Update parameters  2. Calculate the updated value of cost function  3. Make sure updated value is convergent.   Yes, it is so simple!

TOSHI STATS SDN. BHD. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using algorithms, instructions, methods or ideas contained herein, or acting or refraining from acting as a result of such use. TOSHI STATS SDN. BHD. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on TOSHI STATS SDN. BHD. and me to correct any errors or defects in the codes and the software.

Let us go surfing the sea of big data !

In the morning, I check my smart phone and i pad mini to see what happened during the night.  Every time I touch them, data is generated automatically.  How many devices such as smart phones and tablets are there in the world?  I am sure a lot of data is generated at this moment.

It is also noted that FRB, World Bank, IMF and other public institutions make their data available to public through their web sites. In addition, a lot of public data is getting easier to access thanks to the data gathering services. Data is the first key thing to consider when we start data analytics. Therefore, it is very important to know what kind of data is available at your disposal in analyzing data.

I have been using a data gathering service called “Quandl”. Quandl is a “data platform“, which enable us to collect numerical data published by hundreds of different sources, and host them on a single easy-to-use website.  Currently it can be used for free. Once I obtain the data, I visualize it in order to understand what it means and what the mechanism is  behind the data.  I use ” DataHero” to visualize the data I obtain.  It is easy to produce many kinds of charts and graphs. By “DataHero”, I can produce a lot of graphs by following its instructions, then choose the best one to present what I want to say.  It can be used for basic functionality without any fee. If you pay fees, you can get more functionality such as a tool to combine multiple datasets.  

 

According to the sun newspaper on May 20 2014, Mr Najib Abdul Razak, prime minister of Malaysia, said that the 6.2% of GDP growth in the first quarter of this year as extremely outstanding.  This is the highest among the list he presented in his Facebook site.  Let us see what is going on from the past to the present in terms of economic growth of Malaysia. I pick up the data of Real GDP growth rate,  Unemployment rate and Consumer price index (CPI) since 1990 in Malaysia by using  Quandl and visualize these data by using DataHero. It is very easy and takes less than 5 minutes if you are getting familiar with these systems.    Source : Open Data for Africa (IMF)

DataHero Malaysia economic growth

This graph tells us economic growth in Malaysia since 1990.  The growth rate is over 5%, except two economic crisis in 1998, 2009 and 2001.  The unemployment rate has been around 3%, which is good for the economy.  CPI is also around 3% and stable. I can say economic growth without inflation currently is achieved in Malaysia.

I would like to compare it to the situation in Japan since 1980.  Let us see the graph below.

DataHero Japan economic growth

GDP growth peaked out in the late of the 1980s, when the bubble economy was peaked.   Since 1990, when the bubble burst,  Japan has experienced the low economic growth.  CPI has been very low and sometimes went to negative as Japan has been in deflation.  The unemployment rate has been gradually increasing and  peaked over 5%. This period is called “the lost two decades” as Japan has poor economic performances. It is not easy to explain why it happened in Japan.  Some economists blamed monetary policy was not so effective enough to recover its economy.  Others criticized the fiscal stimulate was too late, too small and too short. I would like to analyze the mechanism of ” lost two decades” going forward in this blog.

 

 

 

Do you want to be “Analytic savvy manager”?

Data is and will be around us and it is increasing at an astonishing rate.  In the such business environment,  what should business managers do?    I do no think every manager should have an analytics skill at the same level as a data scientist because it is almost impossible.  However, I do think every manager should communicate with data scientists and make better decisions by using output from their data analysis.  Sometimes this kind of manager is called “analytic savvy manager”.  Let us consider what “analytic savvy manager”should know.

 

1.   What kind of data is available to us?

Business managers should know what kind of data is available for their business analysis.  Some of them are free and others are not.  Some of them are in  companies or private and others are public.  Some of them are structured and others are not.  It is noted that data which are available is increasing in terms of volume and variety.  Data is a starting point of analysis, however, data scientists may not know specific fields of business in detail. It is business managers that know what data is available to businesses.  Recently data gathering services have provided us a lot of data for free.  I recommend you to look at “Quandl” to find public data. It is easy to use and provides a lot of public data for free.  Strong recommendation!

 

2.  What kind of analysis method can be applied?

Business managers do not need to memorize formulas of each analysis method.  I recommend business managers to understand simple linear regression and logistic regression and get the big picture about how the statistical models work. Once you are familiar with two methods,  you can understand other complex statistical models with ease because fundamental structures are not so different among methods.  Statistical models enable us to understand what big data means without loss of information.  In addition to that,  I also recommend business managers to keep in touch with the progress of machine learning,  especially deep learning.  This method has great performances and is expected to be used in a lot of business field such as natural language processing.  It may change the landscape of businesses going forward.

 

3.  How can output from analysis be used to make better decisions?

This is critically important to make a better decision.  Output of data analysis should be in aligned with business needs to make decisions.  Data scientist can guarantee whether numbers of the output are accurate in terms of calculations. However, they can not guarantee whether it is relevant and useful to make better decisions.  Therefore business managers should communicate with data scientist during the process of data analysis and make the output of analysis relevant to business decisions. It is the goal of data analysis.

 

I do not think these points above are difficult to understand for business managers even though they do not have a quantitative analytic background.   If you are getting familiar with these points above, it would make you different from others at the age of big data.

Do you want to be “Analytic savvy manager”?

13943376615435

 

What an excellent tool ” R ” is !

R language is an incredible statistic tool and have been improved on going.  10 years ago when I tried to exercise data analysis privately at home,  I used excel because my PC already had installed Microsoft Office as its initial setting. On the other hand, I used proprietary tools such as MATLAB in the companies where I worked.  MATLAB was an excellent tool to analyze data but the problem was its cost to keep them.  I could not pay this cost by myself as it was expensive to me,  therefore I was forced to use excel in my personal data analysis in my home.  There was no choice except that. I wished I would have MATLAB in my PC many times before. Many experts in the financial industry have written books about programming of MATLAB. However, I could not program it by myself at home as no MATLAB environment existed there.  So I was very surprised when I saw how R worked three years ago.  It can be downloaded without any fee and has the powerful functions in it. I can program freely and store them as my functions.  I decided to start learning R.  Now that I know how excellent R is and always recommend R for anyone who are interested in statistics and data analytics.

 

R has advantages compared to other tools

1.  R is available without any fee.

This is the biggest advantage to proprietary tools, especially for beginners of data analytics. With R, beginners have opportunities to have experience of data analytics by a tool used by professionals.  R lowers the barrier to enter the world of data analytics.  Many people start data analytics from their curiosity,  In such case, it is very difficult to invest a lot of money to own statistical tools.  Now there is no need to worry,  just go to R-project site and download R.  It is easy and available to everyone as long as one has an access to the internet.

 

2.  R is an open source

R is an open source, therefore, it is transparent and you can make your program as you want. When you make excellent programs,  you can make your programs available to anyone all over the world through Rproject site. If you go to this site, you can find many kinds of programs,  which covered from economics, finance to biostatistics.  These programs are called “package” and prepared by professionals all over the world and you can look at each code if you want.  According to R-project site there are more than five thousand packages and they are still increasing. No one knows what the real total number of programs is in R.  Fortunately, most programs are available to anyone without any cost.  It is wonderful for anyone who are interested in data analytics.

 

3.  There is a lot of information about R on the internet.

R is a good tool for learning statistics because there are a lot of tutorials, instructions and documents on the internet. Most of them are free  so you do not have to buy books about R.  It is one of the reasons why I set up my start up for digital learning of statistical computing. I prepare the introduction course to R. If you are interested in R, you can look at this,  of course, without any fee.

 

I drew this chart by R. It is fun to do that.  Let’s start R now !