Structured, Unstructured and Semi-Structured Data

Big Data vs. Relational Data

The advent of “Big Data” is relatively new and is loosely defined as data that contains greater variety, arriving in increasing volumes and with more velocity. This is also known as the Three Vs. Greater than what? Well, greater than the data that would traditional be handled by a traditional Relational Database Management System (RDBMS).

The Three V’s

Volume – Every day, global data volume is increasing exponentially. There are many cultural, scientific and technological reasons for this including the invention and proliferation of smart phones, wearable technology, IoT devices, cloud computing, machine learning and artificial intelligence.

Velocity – The rate at which data is received and processed. Velocity has less to do with the aforementioned exponential growth of data being stored and more to do with real time streaming of that data and the need to process said data in near real time. Traditional ETL pipelines that operated on daily or even hourly batch processing just aren’t enough and so new solutions that could derive meaningful insights from data sets as they were coming in were necessary.

Variety – The increasingly varied types of data that were being processed. Constraining usable data to a predefined schema (structured data) had and still does have its advantages and in a perfect world all data would automagically be this way. But in the real world, big data solutions offer flexibility to process data much more quickly and in new ways that never would have been possible with traditional RDMS structures. These unstructured and semi-structured data types, such as text, audio, and video, require additional preprocessing to derive meaning and support metadata.

Data Structures

In the context of processing, storage and analysis, all data that exists can be categorized as either structured, unstructured or semi-structured.

Structured Data

Sometimes referred to as quantitative data. This is how all data in the enterprise used to be stored at scale. Structured data is data whose elements are addressable for effective analysis. It has been organized into a formatted repository that is typically a database. They have relational keys and can easily be mapped into pre-designed fields.

Unstructured Data

Unstructured data is a data which is not organized in a predefined manner or does not have a predefined data model, thus it is not a good fit for a mainstream relational database. It’s sometimes referred to qualitative data — you can derive meaning from it, but it can also be incredibly ambiguous and difficult to parse.

Semi-Structured Data

Data that does not reside in a relational database but that has some organizational properties that make it easier to analyze. This data probably is not as strictly typed as structured data but does enforce some rules such as hierarchy and nesting.

BASE and CAP

BASE Model

NoSQL (actually short for “not only” SQL) often times rely on a softer transaction model than ACID abbreviated as BASE:

  • Basically Available
  • Soft State
  • Eventually Consistent

Basically Available

This has to do with the perceived availability of the data. The idea is for data to always be available even if it is not in a consistent or hardened state. There are no isolation levels. One transaction may access a file that is being actively modified by another.

Soft State

Values stored in the database may change because of the eventual consistency model.

Eventually Consistent

Data and changes to said data are replicated across nodes and eventually become consistent. For example, in Hadoop, when a file is written to the HDFS, the replicas of the data blocks are created in different data nodes after the original data blocks have been written. For the short period before the blocks are replicated, the state of the file system isn’t consistent.

As tongue in cheek as BASE is, it’s a bit misleading. It is not meant to be the polar opposite of ACID because most NoSQL implementations do not abandon all ACID principles. Soft State and Eventually Consistent are also so closely related that their distinction serves no real purpose.

The point though, is that a BASE model uses a more relaxed approach to consistency for the sake of being able to scale horizontally much more easily and provide a high degree of availability.

This model is very popular with social media platforms and search engines where “some (maybe partial) results now” is much more important than “all results no matter how long it takes”.

CAP Theorem

Also known as Brewer’s theorem, named after computer scientist Eric Brewer, proposes that it is only possible for a distributed data store to provide two out of the three following guarantees:

Consistency: Every read receives the most recent write or an error

Availability: Every request receives a (non-error) response, without the guarantee that it contains the most recent write

Partition tolerance: The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes