3 Easy Steps to Get Started With Machine Learning: Understand the Concept and Implement Linear Regression in Python

What will we cover in this article?

  • What is Machine Learning and how it can help you?
  • How does Machine Learning work?
  • A first example of Linear Regression in Python

Step 1: How can Machine Learning help you?

Machine Learning is a hot topic these days and it is easy to get confused when people talk about it. But what is Machine Learning and how can it you?

I found the following explanation quite good.

Classical vs modern (No machine learning vs machine learning) approach to predictions.
Classical vs modern (No machine learning vs machine learning) approach to predictions.

In the classical computing model every thing is programmed into the algorithms. This has the limitation that all decision logic need to be understood before usage. And if things change, we need to modify the program.

With the modern computing model (Machine Learning) this paradigm is changes. We feed the algorithms with data, and based on that data, we do the decisions in the program.

While this can seem abstract, this is a big change in thinking programming. Machine Learning has helped computers to have solutions to problems like:

  • Improved search engine results.
  • Voice recognition.
  • Number plate recognition.
  • Categorisation of pictures.
  • …and the list goes on.

Step 2: How does Machine Learning work?

I’m glad you asked. I was wondering about that myself.

On a high level you can divide Machine Learning into two phases.

  • Phase 1: Learning
  • Phase 2: Prediction

The Learning phase is divided into steps.

Machine Learning: The Learning Phase: Training data, Pre-processing, Learning, Testing
Machine Learning: The Learning Phase: Training data, Pre-processing, Learning, Testing

It all starts with a training set (training data). This data set should represent the type of data that the Machine Learn model should be used to predict from in Phase 2 (predction).

The pre-processing step is about cleaning up data. While the Machine Learning is awesome, it cannot figure out what good data looks like. You need to do the cleaning as well as transforming data into a desired format.

Then for the magic, the learning step. There are three main paradigms in machine learning.

  • Supervised: where you tell the algorithm what categories each data item is in. Each data item from the training set is tagged with the right answer.
  • Unsupervised: is when the learning algorithm is not told what to do with it and it should make the structure itself.
  • Reinforcement: teaches the machine to think for itself based on past action rewards.

Finally, the testing is done to see if the model is good. The training data was divided into a test set and training set. The test set is used to see if the model can predict from it. If not, a new model might be necessary.

After that the Prediction Phase begins.

How Machine Learning predicts new data.
How Machine Learning predicts new data.

When the model has been created it will be used to predict based on it from new data.

Step 3: For our first example of Linear Regression in Python

Installing the libraries

Linear regression is a linear approach to modelling the relationship between a scalar response to one or more variables. In the case we try to model, we will do it for one single variable. Said in another way, we want map points on a graph to a line (y = a*x + b).

To do that, we need to import various libraries.

# Importing matplotlib to make a plot
import matplotlib.pyplot as plt
# work with number as array
import numpy as np
# we want to use linear_model (that uses datasets)
from sklearn import linear_model

The matplotlib library is used to make a plot, but is a comprehensive library for creating static, animated, and interactive visualizations in Python. If you do not have it installed you can do that by typing in the following command in a terminal.

pip install matplotlib

The numpy is a powerful library to calculate with N-dimensional arrays. If needed, you can install it by typing the following command in a terminal.

pip install numpy

Finally, you need the linear_model from the sklearn library, which you can install by typing the following command in a terminal.

pip install scikit-learn

Training data set

This simple example will let you make a linear regression of an input of the following data set.

# data set
prices = [245, 312, 279, 308, 199, 409, 200, 400, 230]
size = [50, 60, 35, 55, 30, 65, 30, 75, 25]

Here some items are sold, but each item has a size. The first item was sold for 245 ($) and had a size of 50 (something). The next item was sold to 312 ($) and had a size of 60 (something).

The sizes needs to be reshaped before we model it.

# Importing matplotlib and numpy and sklearn
import matplotlib.pyplot as plt
# work with number as array
import numpy as np
# we want to use linear_model (that uses datasets)
from sklearn import linear_model

# data set
prices = [245, 312, 279, 308, 199, 409, 200, 400, 230]
size = [50, 60, 35, 55, 30, 65, 30, 75, 25]

# reshape the input for regression ( second argument how many items
size2 = np.array(size).reshape((-1, 1))
print(size2)

Which results in the following output.

[[50]
 [60]
 [35]
 [55]
 [30]
 [65]
 [30]
 [75]
 [25]]

Hence, the reshape((-1, 1)) transforms it from a row to a single array.

Then for the linear regression.

# Importing matplotlib and numpy and sklearn
import matplotlib.pyplot as plt
# work with number as array
import numpy as np
# we want to use linear_model (that uses datasets)
from sklearn import linear_model

# data set
prices = [245, 312, 279, 308, 199, 409, 200, 400, 230]
size = [50, 60, 35, 55, 30, 65, 30, 75, 25]

# reshape the input for regression ( second argument how many items
size2 = np.array(size).reshape((-1, 1))
print(size2)

regr = linear_model.LinearRegression()
regr.fit(size2, prices)
print("Coefficients", regr.coef_)
print("intercepts", regr.intercept_)

Which prints out the coefficient (a) and the intercept (b) of a formula y = a*x + b.

Now you can predict future prices, when given a size.

# How to predict
size_new = 60
price = size_new * regr.coef_ + regr.intercept_
print(price)
print(regr.predict([[size_new]]))

Where you both can compute it directly (2nd line) or use the regression model (4th line).

Finally, you can plot the linear regression as a graph.

# Importing matplotlib and numpy and sklearn
import matplotlib.pyplot as plt
# work with number as array
import numpy as np
# we want to use linear_model (that uses datasets)
from sklearn import linear_model

# data set
prices = [245, 312, 279, 308, 199, 409, 200, 400, 230]
size = [50, 60, 35, 55, 30, 65, 30, 75, 25]

# reshape the input for regression ( second argument how many items
size2 = np.array(size).reshape((-1, 1))
print(size2)

regr = linear_model.LinearRegression()
regr.fit(size2, prices)

# Here we plot the graph
x = np.array(range(20, 100))
y = eval('regr.coef_*x + regr.intercept_')
plt.plot(x, y)
plt.scatter(size, prices, color='black')
plt.ylabel('prices')
plt.xlabel('size')
plt.show()

Which results in the following graph.

Example of linear regression in Python
Example of linear regression in Python

Conclusion

This is obviously a simple example of linear regression, as it only has one variable. This simple example shows you how to setup the environment in Python and how to make a simple plot.

Understand the Security of One Time Pad and how to Implement it in Python

What is a One Time Pad?

A One Time Pad is a information-theoretical secure encryption function. But what does that mean?

On a high level the One Time Pad is a simple XOR function that takes the input and xor’s that with a key-stream.

One Time Pad illustrated.
One Time Pad illustrated.

Encryption and decryption are identical.

  • Encryption: Takes the plaintext and the key-stream and xor’s it to get the cipher text.
  • Decryption: Takes the cipher text and the (same) key-stream and xor’s it to get the plaintext.

Some requirements of the One Time Pad are.

  • Key-stream should only be used once.
  • Key-stream should only be known by the sender and the receiver.
  • Key-stream should be generated by true randomness

Hence, the requirement are only on the key-stream, which obviously is the only input to the algorithm.

The beauty of the algorithm is the simplicity.

Understand the Security of the One Time Pad

The One Time Pad is information-theoretical secure.

That means, that even if the evil adversary had infinite computing power, it could not break it.

One Time Pad is unbreakable - it is information-theoretical secure.
One Time Pad is unbreakable – it is information-theoretical secure.

The simples way to understand why that is the case is the following. If an adversary catches an encrypted message, which has length, say 10 characters. It can decrypt to any message of length 10.

The reason is, that the key-stream can be anything and is a long as the message itself. That implies, that the plaintext can be possible message of 10 characters.

If the key-stream is unknown, then the cipher text can decrypt to any message.
If the key-stream is unknown, then the cipher text can decrypt to any message.

Implementation in Python

Obviously, we have a dilemma. We cannot generate a key like that in Python.

The actual implementation of the One Time Pad is done by a simple xor.

def xor_bytes(key_stream, message):
    return bytes([key_stream[i] ^ message[i] for i in range(length)])

Of course, this requires that the key_stream and message have the same length.

It also leaves out the problem of where the key_stream comes from. The problem is, that you cannot create a key_stream with the required properties in Python.

Demonstrate the security in Python

If you were to receive a message encrypted by a One Time Pad, then for any guess of the plaintext, there is a matching key-stream to get it.

See the code for better understanding it.

def xor_bytes(key_stream, message):
    return bytes([key_stream[i] ^ message[i] for i in range(length)])


cipher
# cipher is the cipher text
# len(cipher) = 10

# If we guess that the plaintext is "DO ATTACK"
# Then the corresponding key_stream can be computes as follows
message = "DO ATTACK"
message = message.encode()
key_stream = xor_bytes(message, cipher)


# Similar, if we guess the plaintext is "NO ATTACK"
# Then the corresponding key_stream can be computes as follows
message = "NO ATTACK"
message = message.encode()
guess_key = xor_bytes(message, cipher)

Conclusion

While One Time Pads are ideal encryption system, they are not practical. The reason is, that there is no efficient way to generate and distribute a true random key-stream, which is only used once and not known by others than sender and receiver.

The Stream Cipher is often used, as a compromise for that. Examples of stream ciphers are A5/1.

A Simple 7 Step Guide to Implement a Prediction Model to Filter Tweets Based on Dataset Interactively Read from Twitter

What will we learn in this tutorial

  • How Machine Learning works and predicts.
  • What you need to install to implement your Prediction Model in Python
  • A simple way to implement a Prediction Model in Python with persistence
  • How to simplify the connection to the Twitter API using tweepy
  • Collect the training dataset from twitter interactively in a Python program
  • Use the persistent model to predict the tweets you like

Step 1: Quick introduction to Machine Learning

Machine Learning: Input to Learner is Features X (data set) with Targets Y. The Learner outputs a Model, which can predict (Y) future inputs (X).
Machine Learning: Input to Learner is Features X (data set) with Targets Y. The Learner outputs a Model, which can predict (Y) future inputs (X).
  • The Leaner (or Machine Learning Algorithm) is the program that creates a machine learning model from the input data.
  • The Features X is the dataset used by the Learner to generate the Model.
  • The Target Y contains the categories for each data item in the Feature X dataset.
  • The Model takes new inputs X (similar to those in Features) and predicts a target Y, from the categories in Target Y.

We will implement a simple model, that can predict Twitter feeds into two categories: allow and refuse.

Step 2: Install sklearn library (skip if you already have it)

The Python code will be using the sklearn library.

You can install it, simply write the following in the command line (also see here).

pip install scikit-learn

Alternatively, you might want to install it locally in your user space.

pip install scikit-learn --user

Step 3: Create a simple Prediction Model in Python to Train and Predict on tweets

The implementation accomplishes the the machine learning model in a class. The class has the following features.

  • create_dataset: It creates a dataset by taking a list of data that are representing allow, and a list of data that represent the reject. The dataset is divided into features and targets
  • train_dataset: When your dataset is loaded it should be trained to create the model, consisting of the predictor (transfer and estimator)
  • predict: Is called after the model is trained. It can predict an input if it is in the allow category.
  • persist: Is called to save the model for later use, such that we do not need to collect data and train it again. It should only be called after dataset has been created and the model has been train (after create_dataset and train_dataset)
  • load: This will load a saved model and be ready to predict new input.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
import joblib


class PredictionModel:
    def __init__(self):
        self.predictor = {}
        self.dataset = {'features': [], 'targets': []}
        self.allow_id = 0
        self.reject_id = 1

    def create_dataset(self, allow_data, reject_data):
        features_y = allow_data + reject_data
        targets_x = [self.allow_id]*len(allow_data) + [self.reject_id]*len(reject_data)
        self.dataset = {'features': features_y, 'targets': targets_x}

    def train_dataset(self):
        x_train, x_test, y_train, y_test = train_test_split(self.dataset['features'], self.dataset['targets'])

        transfer = TfidfVectorizer()
        x_train = transfer.fit_transform(x_train)
        x_test = transfer.transform(x_test)

        estimator = MultinomialNB()
        estimator.fit(x_train, y_train)

        score = estimator.score(x_test, y_test)
        self.predictor = {'transfer': transfer, 'estimator': estimator}

    def predict(self, text):
        sentence_x = self.predictor['transfer'].transform([text])
        y_predict = self.predictor['estimator'].predict(sentence_x)
        return y_predict[0] == self.allow_id

    def persist(self, output_name):
        joblib.dump(self.predictor['transfer'], output_name+".transfer")
        joblib.dump(self.predictor['estimator'], output_name+".estimator")

    def load(self, input_name):
        self.predictor['transfer'] = joblib.load(input_name+'.transfer')
        self.predictor['estimator'] = joblib.load(input_name+'.estimator')

Step 4: Get a Twitter API access

Go to https://developer.twitter.com/en and get your consumer_key, consumer_secret, access_token, and access_token_secret.

api_key = {
    'consumer_key': "",
    'consumer_secret': "",
    'access_token': "",
    'access_token_secret': ""
}

Also see here for a deeper tutorial on how to get them if in doubt.

Step 5: Simplify your Twitter connection

If you do not already have the tweepy library, then install it by.

pip install tweepy

As you will only read tweets from users, the following class will help you to simplify your code.

import tweepy


class TwitterConnection:
    def __init__(self, api_key):
        # authentication of consumer key and secret
        auth = tweepy.OAuthHandler(api_key['consumer_key'], api_key['consumer_secret'])

        # authentication of access token and secret
        auth.set_access_token(api_key['access_token'], api_key['access_token_secret'])
        self.api = tweepy.API(auth)

    def get_tweets(self, user_name, number=0):
        if number > 0:
            return tweepy.Cursor(self.api.user_timeline, screen_name=user_name, tweet_mode="extended").items(number)
        else:
            return tweepy.Cursor(self.api.user_timeline, screen_name=user_name, tweet_mode="extended").items()
  • __init__: The class sets up the Twitter API in the init-function.
  • get_tweets: Returns the tweets from a user_name (screen_name).

Step 6: Collect the dataset (Features X and Target Y) from Twitter

To simplify your life you will use the above TwitterConnection class and and PredictionModel class.

def get_features(auth, user_name, output_name):
    positives = []
    negatives = []
    twitter_con = TwitterConnection(auth)
    tweets = twitter_con.get_tweets(user_name)
    for tweet in tweets:
        print(tweet.full_text)
        print("a/r/e (allow/reject/end)? ", end='')
        response = input()
        if response.lower() == 'y':
            positives.append(tweet.full_text)
        elif response.lower() == 'e':
            break
        else:
            negatives.append(tweet.full_text)
    model = PredictionModel()
    model.create_dataset(positives, negatives)
    model.train_dataset()
    model.persist(output_name)

The function reads the tweets from user_name and prompts for each one of them whether it should be added to tweets you allow or reject.

When you do not feel like “training” your set more (i.e. collect more training data), then you can press e.

Then it will create the dataset and train it to finally persist it.

Step 7: See how good it predicts your tweets based on your model

The following code will print the first number tweets that your model will allow by user_name.

def fetch_tweets_prediction(auth, user_name, input_name, number):
    model = PredictionModel()
    model.load(input_name)
    twitter_con = TwitterConnection(auth)
    tweets = twitter_con.get_tweets(user_name)
    for tweet in tweets:
        if model.predict(tweet.full_text):
            print(tweet.full_text)
            number -= 1
        if number < 0:
            break

Then your final piece is to call it. Remember to fill out your values for the api_key.

api_key = {
    'consumer_key': "",
    'consumer_secret': "",
    'access_token': "",
    'access_token_secret': ""
}

get_features(api_key, "@cnnbrk", "cnnbrk")
fetch_tweets_prediction(api_key, "@cnnbrk", "cnnbrk", 10)

Conclusion

I trained my set by 30-40 tweets with the above code. From the training set it did not have any false positives (that is an allow which was a reject int eh dataset), but it did have false rejects.

The full code is here.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
import joblib
import tweepy


class PredictionModel:
    def __init__(self):
        self.predictor = {}
        self.dataset = {'features': [], 'targets': []}
        self.allow_id = 0
        self.reject_id = 1

    def create_dataset(self, allow_data, reject_data):
        features_y = allow_data + reject_data
        targets_x = [self.allow_id]*len(allow_data) + [self.reject_id]*len(reject_data)
        self.dataset = {'features': features_y, 'targets': targets_x}

    def train_dataset(self):
        x_train, x_test, y_train, y_test = train_test_split(self.dataset['features'], self.dataset['targets'])

        transfer = TfidfVectorizer()
        x_train = transfer.fit_transform(x_train)
        x_test = transfer.transform(x_test)

        estimator = MultinomialNB()
        estimator.fit(x_train, y_train)

        score = estimator.score(x_test, y_test)
        self.predictor = {'transfer': transfer, 'estimator': estimator}

    def predict(self, text):
        sentence_x = self.predictor['transfer'].transform([text])
        y_predict = self.predictor['estimator'].predict(sentence_x)
        return y_predict[0] == self.allow_id

    def persist(self, output_name):
        joblib.dump(self.predictor['transfer'], output_name+".transfer")
        joblib.dump(self.predictor['estimator'], output_name+".estimator")

    def load(self, input_name):
        self.predictor['transfer'] = joblib.load(input_name+'.transfer')
        self.predictor['estimator'] = joblib.load(input_name+'.estimator')


class TwitterConnection:
    def __init__(self, api_key):
        # authentication of consumer key and secret
        auth = tweepy.OAuthHandler(api_key['consumer_key'], api_key['consumer_secret'])

        # authentication of access token and secret
        auth.set_access_token(api_key['access_token'], api_key['access_token_secret'])
        self.api = tweepy.API(auth)

    def get_tweets(self, user_name, number=0):
        if number > 0:
            return tweepy.Cursor(self.api.user_timeline, screen_name=user_name, tweet_mode="extended").items(number)
        else:
            return tweepy.Cursor(self.api.user_timeline, screen_name=user_name, tweet_mode="extended").items()


def get_features(auth, user_name, output_name):
    positives = []
    negatives = []
    twitter_con = TwitterConnection(auth)
    tweets = twitter_con.get_tweets(user_name)
    for tweet in tweets:
        print(tweet.full_text)
        print("y/n/e (positive/negative/end)? ", end='')
        response = input()
        if response.lower() == 'y':
            positives.append(tweet.full_text)
        elif response.lower() == 'e':
            break
        else:
            negatives.append(tweet.full_text)
    model = PredictionModel()
    model.create_dataset(positives, negatives)
    model.train_dataset()
    model.persist(output_name)


def fetch_tweets_prediction(auth, user_name, input_name, number):
    model = PredictionModel()
    model.load(input_name)
    twitter_con = TwitterConnection(auth)
    tweets = twitter_con.get_tweets(user_name)
    for tweet in tweets:
        if model.predict(tweet.full_text):
            print("POS", tweet.full_text)
            number -= 1
        else:
            pass
            # print("NEG", tweet.full_text)
        if number < 0:
            break

api_key = {
    'consumer_key': "_",
    'consumer_secret': "_",
    'access_token': "_-_",
    'access_token_secret': "_"
}

get_features(api_key, "@cnnbrk", "cnnbrk")
fetch_tweets_prediction(api_key, "@cnnbrk", "cnnbrk", 10)