Passionately curious about Data, Databases and Systems Complexity. Data is ubiquitous, the database universe is dichotomous (structured and unstructured), expanding and complex. Find my Database Research at SQLToolkit.co.uk . Microsoft Data Platform MVP

"The important thing is not to stop questioning. Curiosity has its own reason for existing" Einstein

Tuesday, 12 June 2018

Apache Calcite: A Foundational Framework

At ACM SIGMOD/PODS 2018, this week, Hortonworks are talking about Calcite.  A foundational framework for optimized query processing over heterogeneous data sources. It seems an interesting dynamic data management framework that omits some key functions: storage of data, algorithms to process data, and a repository for storing metadata.  

The main goal was to originally improve Apache Hive in three different axes: latency, scalability, and SQL support. Hive and Calcite are more integrated now and the new features for its optimizer aimed to generate better plans for query execution.

It can be used for data virtualization/federation. It supports heterogeneous data models and stores (relational, semi-structured, streaming, and geospatial). This flexible, embeddable, and extensible architecture is an attractive choice for adoption in big-data frameworks.

Sunday, 10 June 2018

24 Hours of PASS Summit Preview 2018

Free Microsoft Data Platform training is to be held 12 June 2018 starting at 12:00 UTC. It is a great opportunity to hear some amazing sessions in 24 hours, of 1 hour back to back content.

Topics covered in this edition include Performance Tuning, Azure Data Lake, Digital Storytelling, Advanced R, Power BI and more! Join in on sessions from Kendra Little, Melissa Coates, Rob Sewell, Brent Ozar, Dejan Sarka, Devin Knight, Mico Yuk and many more experts in their fields.

Saturday, 9 June 2018

ACM SIGMOD/PODS Conference 2018

It is that time of year again when the annual ACM SIGMOD/PODS Conference. To be held 10 -15 June 2018 in Houston. The ACM SIGMOD/PODS Conference is a leading international forum for database researchers, practitioners, developers, and users to explore cutting-edge ideas and results, and to exchange techniques, tools, and experiences.

Microsoft has contributed to the programming of SIGMOD with several researchers serving on committees and inclusion in workshops, research sessions, industry sessions, demo sessions, and poster sessions.

Thursday, 7 June 2018

Microsoft Graph - a useful tool

Microsoft Graph contains digital artifacts on life and work, and also ensures privacy and transparency. An overview of Microsoft Graph explains how it is a gateway to data and intelligence in Microsoft 365. 
The Microsoft Graph exposes APIs for:
  • Azure Active Directory
  • Office 365 services: SharePoint, OneDrive, Outlook/Exchange, Microsoft Teams, OneNote, Planner, and Excel
  • Enterprise Mobility and Security services: Identity Manager, Intune, Advanced Threat Analytics, and Advanced Threat Protection.
  • Windows 10 services: activities and devices
  • Education
The quick start page explain usage and provides examples.

Tuesday, 5 June 2018

Splunk Structure

The Splunk infrastructure is made up of various components

Indexer – processes incoming machine data and stores the results in indexes for searching. Raw data is compressed and indexes point to the data
Search Head – takes the search request and distributes it to the indexes which search the data, then consolidates the results and displays them. Knowledge objects on the search head can be used to create additional fields and transform the data
Forwarder – consume data and forwards it to the indexers for processing
Deployment Server - distributes content and configurations
Cluster Master -  coordinates the replicating activities of the peer nodes and tells the search head where to find data
License Master – shares licenses with other nodes

The Folder Structure

The structure of the folders within Splunk is as follows:

Friday, 1 June 2018

Azure Cloud Collaboration Center

In May Microsoft in Redmond shared the news of their Azure Cloud Collaboration Center. This facility is a first of its kind. The Cloud Collaboration Center space shows customers a snapshot of what is happening with their data 24/7 and enables real-time troubleshooting of issues by multiple teams simultaneously from across the organization.  It combines innovation and scale to address operational issues and unexpected events in order to drive new levels of customer responsiveness, security and efficiency. 

It is great to see a space for correlation of information with the possibility to pull the data up on individual workstations. I think this kind of collaboration is great news for customers.

Monday, 28 May 2018

CODEX: The Control of Data EXpediently

Complexity exists in database systems whether managing data or databases. The ability to understand this complexity and use this understanding to improve and innovate in the management of these database systems is an important step.  My PhD research investigations centred around understanding best practices and the complexity that exists around management of database systems. 

I used the analogy of a CODEX. The CODEX is a blueprint for database systems management. The acronym CODEX was selected by analogy with the revolutionary introduction of the Codex (Netz & Noel 2007, pp.69–85) in the first century AD which changed the storage medium from a roll to a Codex (book format). This brought challenges migrating the data, but significant benefits of increased speed of data access, reference and durability (of the parchment). Not all texts were migrated from rolls to Codex and those that were not migrated became defunct. Text case was changed from capitals to lowercase and minuscule copies made, resulting in further change; original majuscule manuscripts have not survived. This scholarly activity led to a revival in reading classic documents and a development of a centre of culture.

The CODEX is one of the outputs from the research, based on interpretation of the data. The components stated in the CODEX, are the most prevalent components that are connected when managing database systems.

The CODEX (Control of Data EXpediently) blueprint acronym is constructed as: C for control; O for control of Operations; D for data; E for expediently; and X for unpredictable events. It is an acronym for a system or way of controlling operations and data in a rapid, efficient and accurate manner.

The five inputs into the CODEX are required for every piece of data or database management work. These five inputs are described in the paragraphs below.

C. An important step in any database system is the control system: defining the business needs, budget, controlling the people and time factors. People are important in the management of the database system. It involves the stakeholders and the teams working together to achieve a single goal. The culture driving this collaborative venture forward will undoubtedly raise conflict, but this should be integrated with a high level of communication with all levels in management, the stakeholders, the teams and data and database staff. Also, the governance related to data and data quality should be controlled.

O. Control of operations of the database system is the core day to day running of management tasks, the processes and the performance of the system, orchestrating management through automated and self-managing systems where possible. Technical management needs to understand how internal and external technologies integrate. This is vital when using cloud technologies because internal managers have no control of the details. All of these operations require security to be considered to protect the data.

D. The increasing volume of data acquired today requires storage in various forms. Thus, databases or big data solutions have developed to satisfy the current demand not only for storage but also to provide information quickly and accurately. The variety of data, big or small, requires governance and has a purpose. The reporting and visualization of data is key to enhance business ability to grow, adapt and understand the complexity. Data is continually changing and more of it needs to be stored to meet the demands of society. Being able to understand the data for it to be available and useful, is a core requirement to improve and innovate.

E. Expediency is driven from the need to have efficient control over costs, speed of delivery and change. Designing database systems that are easy to manage, simple and agile utilising reference architectures and blueprints is key to performing expediently. To be able to proceed the critical factors are knowledge, skills, learning, leading to understanding and allowing planning to unfold unhindered. Development can lead to fast performing applications and efficient management through automation.

X. With any system and particularly in a diverse and ubiquitous database system, systems change is always happening, be it with the number and type of database platforms, the new technologies, global business or environment change. Change is rapid and diverse. Using patterns and always establishing best practice will help in the management of database systems. These best practices need to be able to rapidly change as requirements change or are not known at the outset. Producing documentation that can be automatically created is key for accuracy and ensuring documentation is available. Also, unpredictable events can occur and any changes to the components must be documented and changes to all respective components made. There is continuous feedback over time. 

I will discuss in another blog post, how the suggested pattern, CODEX (Control of Data EXpediently), can help improvement in managing database systems. Using AI to improve management, incorporating telemetry, and systems diagramming will aid with the change to come.