In this blog post, we are going to develop an SMS spam detector using logistic regression and pySpark. We will predict whether an SMS text is spam or not Brace yourselves the spam is coming!!!

Dataset

Text file can be downloaded from here. This is what our dataset looks like:

We are using pySpark for distributed computing and we will create a machine learning pipeline to automate the workflow. We see that the “type” column contains categorical data, so the first step is to convert the contents of the “type” column to numeric attributes. We encode because logistic regression cannot operate with categorical data.

sms_spam_df.createOrReplaceTempView('temp')
sms_spam2_df=spark.sql('select case type when "spam" then 1.0 else 0 end as type, text from temp')
sms_spam2_df.show()

Now, we will create a pipeline that will combine a Tokenizer, CounterVectorizer, and an IDF estimator to compute the TF-IDF vectors of each SMS.


from pyspark.ml.feature import Tokenizer
tokenizer = Tokenizer(inputCol="text", outputCol="words")
from pyspark.ml.feature import CountVectorizer
countvectorizer = CountVectorizer (inputCol="words", outputCol="tf")
from pyspark.ml.feature import IDF
idf = IDF(inputCol="tf", outputCol="tfidf")
tfidf_pipeline = Pipeline(stages=[tokenizer,countvectorizer,idf]).fit(sms_spam2_df)
  • Tokenizer: It creates words(tokens) from sentences for each SMS
  • Countvectorizer: It counts the number of times a token shows up in the document and uses this value as its weight.
  • TF-IDF Vectorizer: TF-IDF stands for “term frequency-inverse document frequency”, meaning the weight assigned to each token not only depends on its frequency in a document but also how recurrent that term is in the entire corpora.

Now that we have our sentences in TF-IDF form, let us create ML pipelines where the first stage is the tfidf_pipeline created above and the second stage is a LogisticRegression model with different regularization parameters (𝜆) and elastic net mixture (𝛼)

Compare Models

training_df, validation_df, testing_df = sms_spam2_df.randomSplit([0.6, 0.3, 0.1], seed=0)
lr1 = classification.LogisticRegression(regParam=0, elasticNetParam=0, labelCol = 'type', featuresCol = 'tfidf')
lr_pipeline1 = Pipeline(stages=[tfidf_pipeline, lr1]).fit(training_df)
lr2 = classification.LogisticRegression(regParam=0.02, elasticNetParam=0.2, labelCol = 'type', featuresCol = 'tfidf')
lr_pipeline2 = Pipeline(stages=[tfidf_pipeline, lr2]).fit(training_df)
lr3 = classification.LogisticRegression(regParam=0.1, elasticNetParam=0.4, labelCol = 'type', featuresCol = 'tfidf')
lr_pipeline3 = Pipeline(stages=[tfidf_pipeline, lr3]).fit(training_df)

We use cross-validation because it helps us better use our data, and it gives much more information about our algorithm’s performance.

evaluator = evaluation.BinaryClassificationEvaluator(labelCol='type')
AUC1 = evaluator.evaluate(lr_pipeline1.transform(validation_df))
print("Model 1 AUC: ", AUC1)
AUC2 = evaluator.evaluate(lr_pipeline2.transform(validation_df))
print("Model 2 AUC: ", AUC2)
AUC3 = evaluator.evaluate(lr_pipeline3.transform(validation_df))
print("Model 3 AUC: ", AUC3)

We see that model 2 with pipeline lr_pipeline2 containing regParam=0.02, elasticNetParam=0.2 performs best on the validation data so we fit this pipeline to our test data and find out it’s AUC on test data.

AUC_best = evaluator.evaluate(lr_pipeline2.transform(testing_df))

Inference

Now we will use the pipeline 2 fitted above (lr_pipeline2) to create Pandas data frames that contain the most negative words and the most positive words.

Conclusion

We converted our text into tokens and TF-IDF vectors, played around with parameters of logistic regression, and evaluated our model using AUC metric on the validation data. Finally, based on the performance of the model on the validation data, we fitted the model on the test data, and measured the model’s performance. Thus, we developed a spam detector using regularized logistic regression.

Can we improve the performance of our model?

The performance of the model can be improved by feature engineering on the data. Typical spam messages contain words that are upper case. So we create a data frame sms_spam3_df where we add a new column has_uppercase which contains an integer 1 if the first sequence of uppercase letters is longer or equal to 3 and an integer 0 otherwise

from pyspark.sql.functions import regexp_extract, col,length
sms_spam3_df=sms_spam2_df.select('type','text', regexp_extract(col('text'), '([A-Z]{3,})', 1).alias('has_upppercase'))
sms_spam3_df = sms_spam3_df.select('type','text', fn.when(fn.length(fn.col('has_upppercase')) >= 3, 1).otherwise(0).alias('has_uppercase'))

Let’s see what our data frame looks like:

from pyspark.ml.feature import Tokenizer
tokenizer_df3 = Tokenizer(inputCol="text", outputCol="words")
from pyspark.ml.feature import CountVectorizer,VectorAssembler
ountvectorizer_df3 = CountVectorizer(inputCol="words", outputCol="tf")
from pyspark.ml.feature import IDF
idf_3 = IDF(inputCol="tf", outputCol="tfidf")
assembler = VectorAssembler(inputCols=["has_uppercase", "tfidf"],outputCol="features")
tfidf_pipeline_upper = Pipeline(stages=[tokenizer_df3,countvectorizer_df3,idf_3,assembler]).fit(sms_spam3_df)
df_upper = tfidf_pipeline_upper.transform(sms_spam3_df)
#!pip install git+https://github.com/daniel-acuna/pyspark_pipes.git
from pyspark_pipes import pipe
scaled_model=pipe(feature.VectorAssembler(inputCols=['has_uppercase','tfidf']),feature.MaxAbsScaler(),classification.LogisticRegression(regParam=0.2, elasticNetParam=0.1, labelCol = 'type'))
scaled_model_fitted = scaled_model.fit(df_upper)

Now, that we have two columns of text, and has_uppercase, we have to tokenize and create TF-IDF of the text and then merge with has_uppercaase column using vector assembler. We create a pipeline that will merge the two columns, perform feature scaling using MaxAbsScaler, and run a logistic regression model (lr2 regularization parameter 𝜆=0.2 and elastic net mixture 𝛼=0.1) that performs best on the above data frame.

my_coeff=scaled_model_fitted.stages[-1].coefficients
has_uppercase_coeff = my_coeff.toArray()[0]
print('has_uppercase feature is positively related to an SMS being spam with a coefficient of:',has_uppercase_coeff)

We fetch the coefficient of has_uppercase feature from the pipeline and it comes out to be 0.9289. Thus, has_uppercase is positively related to an SMS being spam

What is the ratio of the coefficient of has_uppercase to the biggest positive tfidf coefficient?

max_coeff=my_coeff.toArray().max()
my_ratio=has_uppercase_coeff/max_coeff
print('The ratio of the coefficient of has_uppercase to the biggest positive tfidf coefficient is :',my_ratio)

The max-coefficient of Tfidf comes out to be 2,01, so the ratio of the coefficient of has_uppercase to the biggest positive tfidf coefficient is 0.46


Thank you for reading! Feedbacks are highly appreciated.