In this project, we will be building a disaster response web application that will classify the message into different categories like medical supplies, food, or block road and direct them to the right organization to provide speedy recovery!

In 2019, there were a total of 409 natural disasters worldwide. The irony is that we are right now in the middle of a global pandemic due to Covid19. During a disaster or following the disaster, millions of people communicate either directly or via social media to get some help from the governmentor disaster relief and recovery services. If the affected person is tweeting it or even sending a message to the helpline service chances are that the message will be lost in the thousands of messages received. Sometimes itโ€™s because a lot of people are just tweeting and very few people are needing help and organizations do not have enough time to filter out these many messages manually.


Data Overview

We will be analyzing real messages that were sent during disaster events. The data was collected by Figure Eight and provided by Udacity, a big thank you to them. Letโ€™s look at the data description:

  • messages.csv: Contains the id, message that was sent and genre i.e the method (direct, tweet..) the message was sent.
  • categories.csv: Contains the id and the categories (related, offer, medical assistance..) the message belonged to.

ETL Pipeline

So, in this part, we will merge two datasets, messages.csv, and categories.csv on the common id column. The category column in the id categories.csv is in a string format so we need to create columns for each category. Then we will remove duplicates and load the dataset the transformed data into the database hosted using the SQLAlchemy library.

The categories are of the form

After transformation

Load data into database

And we finally load the transformed data into the database: disaster.db. Checkout the code for the entire ETL pipeline here

engine = create_engine('sqlite:///disaster.db');
df.to_sql('disaster_response', engine, index=False);

ML Pipeline

Here, we will load the dataset from the disaster.db database. Our main task is to convert the messages into tokens so that they can be interpreted. So, we create a function that will remove punctuations, tokenize words, remove stop words, and perform lemmatization.

url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
def tokenize(text):
    # Detect URLs
    detected_urls = re.findall(url_regex, text)
    for url in detected_urls:
        text = text.replace(url, 'urlplaceholder')
    
    # Normalize and tokenize and remove punctuation
    tokens = nltk.word_tokenize(re.sub(r"[^a-zA-Z0-9]", " ", text.lower()))
    
    # Remove stopwords
    tokens = [t for t in tokens if t not in stopwords.words('english')]
    # Lemmatize
    lemmatizer=WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(t) for t in tokens]
    
    return tokens

So, this is what our function will do

These words make sense but they cannot be understood by the ML model. So, we will use countVectorizer and tfidf transformer to transform the tokens into features(integers) and we use a simple Random Forest Classifier to fit the training data.

pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer = tokenize))
    , ('tfidf', TfidfTransformer())
    , ('clf', MultiOutputClassifier(RandomForestClassifier()))])
pipeline.fit(X_train, Y_train)

For evaluating our model, we will be using the F-1 score as both False Negatives and False positives are important to us i.e. if we fail to predict the right category of the message then we wonโ€™t be able to provide right assistance and if we wrongly predict the category of the message we will be wasting our time. The Random Forest classifier gives us an F-1 score of 0.44. The main reason behind the low score is that the categories are highly imbalanced. The distribution of the categories is as follows:


Letโ€™s improve the model using some different ML model and hyperparameter tuning. So, after doing a GridSearchCV to find the best parameter of the Random Forest model we were able to increase the F-1 score to 0.51. Next, we train AdaBoost classifier and we were able to improve the F-1 score to 0.59

#https://medium.com/swlh/the-hyperparameter-cheat-sheet-770f1fed32ff
pipeline_ada = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(
        AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1, class_weight='balanced'))
    ))
])
parameters_ada = {
    'clf__estimator__learning_rate': [0.1, 0.3],
    'clf__estimator__n_estimators': [100, 200]
}
cv_ada = GridSearchCV(estimator=pipeline_ada, param_grid=parameters_ada, cv=3, scoring='f1_weighted', verbose=3)

We save this model as a pickle file so that we do not need to train it again. Code available here

pickle.dump(cv_ada, open('disaster_ada_model.sav', 'wb'))

Flask Application We will create a train_classifer.py to create functions that will transform, load, build, and save the model. Basically, we will use the ETL pipeline and ML pipeline. We will create a folder named app that will contain a master.html file which will be the front end and run.py that will run behind to perform computation.

Demo

Conclusion

Build a full-stack multi-output ML web application to classify messages sent during disasters into different categories and provide quick assistance from different disaster relief organizations. You can run this application on your own computer by following the instructions on my github.


Thank you for reading!