Understanding the competition brief, the Request For Proposal (RFP) or any demand from your client is essential to provide the best service. In the other hand, understanding your competitors offer will help your business to bring value to the market through your products and/or services. We need to understand both entities: Clients and Competitors to respond to the business ecosystem, for example through an effective marketing campaign or the right arguments to win a bid.
In this article, we will explain the benefits of using Machine Learning (ML) to get the insights in stadium design descriptions from client and architectural practices.
Natural Language Processing
Natural Language Processing is a machine learning subfield to process and analyze natural languages expressed in audio, texts or other formats. Texts, typically doesn't have defined format or structure. The sentences in paragraphs express ideas and the same idea can be expressed in different ways:
- I have a train to catch from Waterloo station.
- My train departs from Waterloo station.
- From Waterloo station, my train departs.
- I will pick the train from Waterloo.
In the example above, the same idea is expressed in different ways, it's an approximation to a concept. This is called unstructured data, usually categorized as qualitative data: concepts and characteristics. Since, there is no a defined model it's difficult to deconstruct. For example, if we pick the third word from each sentence we will get: a, departs, station, pick. There is no meaning in the information and cannot be organized in a database.
Mostly all data we generate nowadays is unstructured and it's difficult to find their insight. There are several techniques to process and analyze structured data, and yet, few techniques to analyze natural language data. Structured data gives you general information from clients and competitors. In contrast, unstructured data can give you a much deeper understanding because you can measure the client's sentiment toward an specific idea expressed in a competition brief. You can also understand what your competitors say about specific subject.
The data used in the project is about design stadium descriptions from 2 different point of views:
- Architectural design practice: Descriptions from small 10K stadiums capacity to international award wining architectural practices with 60K seats stadiums.
- Owners and operators: Stadium owners and company operators describing stadiums.
Bag of words
The text preprocessing is the first step to make the data useful for a ML model and consist in to clean the text. The diacritics (a glyph added to a letter) and stop words are removed from both data sets. Words like the, or, are very common in the text, but do not represent real meaning.
The encoding technique transform the words in the text corpus into vectors and there are different ways to do it. In the first part of the project the BagOfWord is implemented to return the occurrences of each word from both data sets. For this task Laga library was used. In the second part of the project the TF-IDF, was implemented throughscikit learn library.
There is a lot information we can extract comparing these 2 distributions. We will only mention 2:
The word VIP in client's distribution it's the 3rd in the list with occurrences number in the document. On the contrary, the architectural practice place the word VIP in the position 36 in relation to the occurrences number. This means clients description gives much more importance to the word VIP than the architectural practice.
Client's data set distribution is stepped, centered in the words business, infrastructure, construction, services, among others. In the other hand, the architectural practice data set distribution is much more "flat". More words have the same "weight" in the text corpus. For example the word revenue has the same weight as the word new.
K-Means Clustering - Unsupervised Learning.
In this step, we will processes the architectural design data set with KMeans to group the text in different concepts and reveal the hidden patterns. Clustering is a form of unsupervised learning that consist in reveal undetected patterns in data points with no labels. This means the sentences are not labeled like: the sentence 1 is about "public space", the sentence 2 is about "design", etc.
For this project, we trained a KMeans model with stadium descriptions to understand the most N concepts about stadium and design. There are different ways to calculate how many clusters are needed. Several test were conducted and 3 clusters was considered the best option.
Load the text and print the first sentence in the data set.
Clean the data set: tokenization, Vectorization and Stem.
Sum the words to find which are the top sentences for the concepts stadium and design: Which sentences in the data set encapsulate the most about stadium and design concepts:
42: The design of the stadium is unique its facade and roof are integrated and formed by petalshaped modules designed to be one of the highest stages on one side giving the impression of sand dunes which are common in the region moving the design also allows for greater ventilation and light inside the stadium
12: The stadium design incorporates the most important experience of any football stadium in the world with large resting spaces designed to accommodate the greatest possible hobby inspired by the natural beauty of monterrey and the rich history of the city the stadium will become a real destination to watch a game while redefining the design of the stadiums in mexico through creative urban planning understanding global trends that makes up the fan experience and a sophisticated and advanced vision of the future of sports estadio bbva bancomer will transform expectations across the region
39: Bbva compass stadium is designed to be the core of houston downtown east neighborhood redevelopment plan with a capacity of seats its main use is as a football stadium however it can also accommodate lacrosse rugby or concerts with the challenge of generating an architectural icon for the area the stadium had to be designed within the constraints of a very tight budget to bring his vast experience in european football stadiums
72: Konya stadium is designed with a harmonization approach between cultural codes and contemporary structure
113: The new louis armstrong stadium located inside the billie jean king national tennis center features seats and an innovative design that encourages airflow through the stadium and prevents rain from reaching spectators and the court making it the world first tennis stadium with natural ventilation and retractable roof
This is the ML model and the training process. The second "cell" prints the top 10 words per each cluster. It's arguable that these clusters are similar. For example, the word "stadium" or "design" appear in each cluster, surrounded by different concepts.
- Cluster 0: (design, public, project, stand, sport, area, architectur, build, use, allow)
- Cluster 1: (structur, stadium, architectur, build, allow, project, access, citi, also, design)
- Cluster 2: (stadium, citi, design, build, area, also, project, access, space, allow)
From my point of view, the cluster 0 is more about public-space, the cluster 1 is more about structure-build and the cluster 2 is more about stadium-design.
The plot above shows the different clusters by colors: Red represents the public-space concept, green represents the structure-build concept and blue represents stadium-design concept.
The red and blue points are distributed along the X axis and below than 0.2 in the Y access. These 2 clusters are similar in concepts. The green points (structure-build) are above the 0.2 in the Y axis, and it make sense because represents a more different concept in relation to the other 2 clusters.
Typically architects don't write so much about structure in comparison to design or public space. Therefore it's possible to understand the difference in the amount of green points compared to the red (public-space) and blue(stadium-design).
The differences between the architectural design practices are not reflected in the clusters. Although, the data set is composed by 14 different architectural design practices with different experience, educations and clients, this variance is not reflected in clusters. Because (maybe) 2 of the 3 clusters have similar concepts. And the bag of words distribution shows a flat curve along the many different concepts involved in a stadium design. Architects in stadium design, write more about design and public space and left other concepts outside like sustainability or technology.
"Public Space", "Area", "Project" are the most relevant and interesting concepts for different architectural design practices regarding the size of the stadium and or location.
If you find interesting this article please leave a comment.