This article is the third part of a four-part series that will address the following:
- What is BERT in understandable language?
- How BERT affects SEO and content marketing
- The future of BERT and machine learning
- Strategies for adapting alongside BERT’s changes (COMING SOON)
Now that BERT is out and running amuck in our lives, many SEO-ers and content marketers are trying to re-define marketing deliverables around BERT.
Just see the top search results from “content marketing for BERT”:
The biggest hurdle with BERT is that content marketers, SEO analysts and people like us are trying to figure out what to do with BERT.
Danny says we should all chill.
This tweet by Danny Sullivan has by-and-large been the most used example of how SEO should adapt to BERT.
And, honestly, I think it’s bogus.
After all, Google can’t push out the biggest update in the history of search engines and an SEO not to feel any pull to it.
So what I want to do is take a moment to explore briefly where BERT came from and why it’s evolved to where it is now. Then, we can take a look at the future of BERT and machine learning in SEO to begin to map out our future asks in terms of marketing tactics.
Where did BERT come from?
BERT is a product of decades worth of machine learning in computer science. I dove into the making of BERT a little in our original piece.
(You can read up on the original research yourself. The article links you to Donald Michie’s article which provides insight into a decade worth of machine learning. Maybe I’ll get to write a history piece on this later.)
What’s important for this article is the recent trends in machine learning research. By 2017, machine learning in search engine research had evolved to be based on recurrent neural networks (RNNs). RNNs were at the self-described core of language modeling, machine translation, and question answering.
In 2017, Google introduced its novel idea to create a Transformer. The Transformer was a type of novel neural network architecture that adapted to reading (in their case, translation) through self-awareness.
And in computational machine learning research, this was a breakthrough.
Transformer uses the process of recurrent or convolutional neural networks and builds from that process in order to create a machine that works faster and more intuitively than previous RNN models.
Unlike RNNs, Transformer does not need to read in sequential order. It can also take into account every step of interpretation into its final output or translation. Transformers still use standard encoder-decoder configurations but it relies on attention mechanisms.
Here is an animation of how a Transformer would work when translating.
In essence, Transformers can process words in relation to other words in a sentence. This differs from standard natural language processing which typically reads a sentence in sequential order – one-by-one.
Transformer research is where BERT stems from, and this is what Google means when they say that BERT can read a sentence in context and related to other words in a sentence. Previous research, pre-Transformers research, could read sentences and pull out keywords, but the keywords could be entirely unrelated to each other. It was therefore difficult for a machine to identify what it is that the sentence was trying to say.
Since its introduction, using Transformer processing was considered to be the building block of most modern natural language processing (NLP) architecture. Not only did they handle language modeling, machine translation, and question answering more accurately and quicker than RNNs, but they also could handle high-quality text generation as well.
Cracking our queries
The core purpose of machine learning has been to get machines up to our speed. In order for machines to be functional and appropriate in our lives, they also need to replicate how humans learn.
In his 1968 groundbreaking article, Donald Michie articulated just this.
“If computers could learn from experience their usefulness would be increased. When I write a clumsy program for a contemporary computer a thousand runs on the machine do not re-educate my handiwork. On every execution, each time-wasting blemish and crudity, each needless test and redundant evaluation, is meticulously reproduced.”
Since 1968, computer programmers and scientists all over the world have been working towards this model of machine applicability. By creating a computer that can learn from its past mistakes, it will save us the time and effort in attempting corrections that are ultimately futile. A machine needs to handle its own corrections and “learn from its mistakes”, otherwise we would spend all our time correcting machine errors.
BERTs ability to register the whole sentence through each word, within context, in reverse and to read from its own data allows it to learn from its “mistakes”. One way of thinking about this is realizing that once BERT reads a sentence, it tells the decoding process at the end what it learned at the beginning.
With this Transformer breakthrough, researchers were able to more accurately teach machines to understand language – more accurately to how we as humans communicate. The issue that stood in the way at this point was the lack of data. Transformers, and NLP in general, require massive amounts of data.
This is why in 2018, Google decided to release its NLP technique as open-source. By open-sourcing, Google allowed anyone the ability to access and train their own answering system. Many groups have followed suit, so now much of the data-processing for NLPs and AI-machine learning is collaborative (see TensorFlow and their GitHub, and OpenAI’s code).
BERT was also made possible through advanced hardware. Thanks to cloud computing, BERT can operate over an unlimited amount of bandwidth, so to speak. Google uses Cloud TPUs v3 Pods to do this.
This type of computing is referred to as AI-optimized infrastructure, the most modern version of the supercomputer. These computers are built specifically for machine learning and it is built with Transformer model architecture and SSD model architecture, a key component in object detection (think autonomous driving and medical imaging).
The necessary requirements in machine learning require large tech companies to work with primarily open-source code. Since the code is available, the essentials to developing AI-driven programs are available to any.
While data itself is not necessarily open-sourced, Big Data companies are able to use these techniques to more easily sift through large data fonds.
Those at Google, Bing, and other search engines, as well as other large data-output centers, are using techniques to take a hold of the massive amounts of data that we are producing (See: Apache Hadoop).
Mukherjee and Shaw (2016) report that Facebook handles over 40 million photos. Companies like Walmart must sift through billions of transactions. And then there are data-monsters like the CERN’s Large Hadron Collider (LHC) that is producing more than 30 petabytes of data per year. For reference, a petabyte is 1015 bytes of digital information (1,000 terabytes), and 30 of them are produced by the CERN alone.
Thanks to open-source AI codes, companies are able to train their own NLPs to sift through data that is organized thanks to architectures like Hadoop. Other research groups, the CERN is an example of this, are working the other way around. They are open-publishing their raw data which can work as a working hub for NLPs to operate in.
While it may seem outside of the scope, BERT and machine learning really speak to the ability for Big Data to do remarkable things. More research communities are able to invest in these skills to develop better autonomous driving cars as well as address large global issues like poverty or climate-smart agriculture.
Although I cannot say with certainty, I would imagine that the necessary requirements to develop AI-machines, such as the sizable cloud networks, the amount of data, and the amount of testing would open up international alliances by solving global problems. Academia is already seeing a shift like this in references to interdisciplinarity and “research collaboration”, which allows for more viewpoints to address larger issues.
On the grand scheme of things, it seems that machine learning could contribute to a more accessible open network.
Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., & Hellerstein, J. M. (2012). Distributed graphlab: A framework for machine learning in the cloud. arXiv preprint arXiv:1204.6078.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).
Tarekegn, G. B., & Munaye, Y. Y. (2016). Big Data: Security Issues, Challenges and Future Scope. International Journal of Computer Engineering and Technology, 7(4).
Johar, S. (2016). Where Speech Recognition Is Going: Conclusion and Future Scope. In Emotion, Affect and Personality in Speech (pp. 43-49). Springer, Cham.
Mukherjee, S., & Shaw, R. (2016). Big data–concepts, applications, challenges and future scope. International Journal of Advanced Research in Computer and Communication Engineering, 5(2), 66-74.
Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal of research and development, 3(3), 210-229.
Michie, D. (1968). “Memo” functions and machine learning. Nature, 218(5136), 19-22.