Unnatural Language Processing

As a data scientist, the data I work with comes in two forms: structured and unstructured. Structured data is straightforward: columns represent features, rows represent observations. Unstructured data is less straightforward; it can be many different things from video to audio to images to text. For use in algorithmic analysis, unstructured data must be converted into a structured format. In terms of text, there have been many approaches to this problem. Some of the challenges of this problem include the fluidity and variability of text data -especially between different authors- and the field of data science that most often deals with conversion of text to structured data is called natural language processing, or NLP for short. Some of the greatest successes in NLP have come from powerful transformer models like BERT and GPT-2. At a recent project I worked on, however, I had to handle text data that was not generated by a human; the text came from computer logs. The experience of working on this project led me to this question: how does NLP factor into the processing of computer logs?

pexels-kevin-ku-577585.jpg

The first thing to examine is the importance of context.

While LSTM-based text generation can understand the sequence of words on some level, and transformers look at the whole sentence and find context, this is not generally necessary when the text is so rigidly structured and formulaic. A classic example of typical log text is “FILE NOT FOUND”. However, “FILE FOUND” is not usually a computer log. This reduces the variance in the data, which allows for simpler methods. A complex transformer-based embedding system is not necessarily going to give better results than a simple bag-of-words or TFIDF model on this kind of data. Additionally, with such a limited selection of words compared to human-generated language, things like stemming and lemmatization are much less necessary as well. Vectorization attempts to map the features of the text into a concept space that can be represented with numbers; the concept space of machine-generated text is much smaller than that of human language, so an elegant approach is not necessarily a requirement. Human language is constantly changing and flowing, like water, and machine-generated text is rigid and unyielding, like ice.

As mentioned before, the data in this situation has less variance than a traditional NLP problem. With less variance in the data, the model does not need to be complex to bring about good, generalizable results. Neural networks of various architectures are very much in vogue at the moment; LSTM’s were popular for quite some time until the transformer model was built. An LSTM was effective because it was able to learn sequences of words, but transformers outperformed these recurrent models by looking at the whole sentence and understanding the context of words via the multi-headed attention mechanism. These models are very complex, and it needs to be complex in order to be able to capture the intricacies of human language.

pexels-thisisengineering-3862140.jpg

However, with language that isn’t created by humans, but rather by a machine that’s using a strict set of rules, as is the case with computer logs, the complexity of the model can be dramatically lowered. On a project that I worked on that involved this specific situation, I was able to evaluate the performance of models ranging from simple statistical models all the way to complex deep learning solutions. Ultimately, the best results (by a whopping 30 percentage points, at that) came from the simplest model- a multinomial naïve Bayes classifier. This solution outperformed logistic regression, support vector machines, random forests, convolutional neural networks, and a transformer model. Logically, this outcome does make sense- a naïve Bayes model examines conditional probability. A log containing the word “FILE” will probably correspond to a “FILE NOT FOUND” error; this reveals the underlying relationships between the words and the outcome. With human language, a concept space relates similar words to each other in order to attempt to determine intent. With machine-generated computer logs, the relationship is straightforward because of the limitations on the text. The two problems are hardly the same.

This begs the question: can this still be called NLP?

On one hand, the problem incorporated several hallmarks of NLP. The text came into the system as an unstructured format, which was then tokenized and converted to a structured form before applying a machine learning algorithm for analysis. It may not fit neatly under the umbrella of NLP, but the tricks, concepts, and tools do still apply. Additionally, even though the machine made the text, ultimately a human was behind that machine, programming it to make that text. On the other hand, while this situation involves processing language, it does not qualify as natural language. A human was behind it at some point, but the text follows such a rigid and specific set of rules that it may as well already be structured. Ultimately, the question boils down into two parts.

The first part is, how does one define NLP? Is NLP the process of converting unstructured text data into a structured format to perform analysis and algorithms? Or does it necessitate a human element? The second part of the question is, does this situation do enough to meet the criteria for NLP? If the criteria involve the human element, then this situation would not qualify as NLP. A human would not speak in computer log text, simply put. The log text is nothing like human language, apart from using some English words. Furthermore, what minimal human component there is does not inherently make this situation NLP; the machine’s ability to generate text is inherently limited in a distinctly nonhuman way. The machine is rule-based, and restricted, where a human’s capacity for language generation is limitless in variation.

In conclusion

This situation is not truly NLP. It borrows some tactics and tools from the subset of data science that is NLP, but the context is drastically different. Real-world NLP situations involve semantics, context, and variability. Computer logs have minimal semantics, do not need contextualization, and repeat the same text over and over again. Furthermore, the algorithm that was used for the solution is proof enough that this situation is not NLP- the multinomial naïve Bayes classifier was effective only because of the limitations and restrictions on the text. The increasing complexity of NLP models today is a statement on how complex natural language really is; the data is extremely high in variance and a simple statistical model would not be able to capture that variance. Had the problem involved text that was generated by a transformer model, such as BERT or GPT-2, then the text generation would be much more human-like in nature and significantly closer to NLP due to the increase in variance and the decrease in rigidity. Ultimately, while this situation was NLP-adjacent, it was missing the natural component of natural language processing.