Is It Necessary to Use a Specialized In-Memory Database?
It is easy to think about using an in-memory database to solve the performance problem of reporting, BI analysis, batch processing, and other data analysis tasks.
An in-memory database allows storing all data permanently in the memory so that the computation external memory accesses (disk reads) are not needed, disk I/O can be avoided and data processing performance can be effectively improved.
In-memory databases are advertised to have high performance and be able to solve many performance problems in business data analyses. They are fast because of not only the zero disk I/O costs but also the application of specialized memory optimization techniques.
For example, the memory supports random access and has strong parallel processing ability; and supports optimization techniques such as pre-load and pre-index to enhance the computing performance.
However, these techniques are not unique to in-memory databases. All computations are performed in the memory because, after all, the CPU can only compute data in the memory, and data stored in the external memory needs to be read into the memory for processing. As general memory optimization techniques, they are adopted by many database products and computing engines. As long as the memory optimization is carried out cleverly, performance will catch up. In this sense, if the memory is large enough to hold all data, every database can be called an in-memory database. Now the so-called “specialized in-memory databases” only offer optimization methods for the in-memory computations. They are either not good at performing computations involving a volume of data exceeding the memory capacity, or unable to handle them (because external memory computations are much more difficult). To highlight advantages while downplaying disadvantages, vendors name these databases “In-memory databases” to distinguish them from ordinary ones. On the contrary, the ability of real specialized in-memory databases is weakened.
Therefore, there’s no such thing as “specialized in-memory databases”, there are only specialized in-memory data processing techniques. It doesn’t matter whether a product claims to be an in-memory database. The really important thing is to find out whether the in-memory techniques the product uses are effective or not by testing and comparing the computing performance under a large memory.
In-Memory Processing Techniques & Computing Performance
We all know that SQL-based relational databases are still the mainstream database systems — whether the database is the so-called in-memory database or not. But it’s a pity that SQL cannot make the best use of memory characteristics, leaving much room for performance increase. Though certain optimization methods, such as compression, column-wise storage, index, parallel processing, and vector computing, can be well implemented and employed based on SQL, it is inconvenient to implement other techniques that can greatly help increase computing performance in SQL. This is because SQL has its intrinsic limits.
SQL lacks the data type for representing records. SELECT’s result (record) is a new data table unrelated to the original table, even if it has only one record. As a result, a lot of data needs to be copied during the computation (new memory space is thus needed) and costs increase in terms of both space usage and time consumption. If there is an explicit record type, we can directly use records to form a data table. The records will store memory addresses of data in the original table; data won’t be copied frequently or in a large amount in the computing process, memory will be effectively used and performance will be higher.
For the multi-table join computation, if we can convert the foreign key in the fact table into memory addresses of the records in a dimension table in advance (pre-join), no association will be needed when records are used and we will get performance as good as that of the single table query. Yet, it is a pity that SQL cannot exploit the characteristic.
SQL also has poor support for order-based computations. The language is designed based on unordered sets. It needs to define ordinal numbers for members of the unordered sets or offer location operations and neighboring member reference mechanisms, making it difficult to exploit data orderliness to implement high-efficiency algorithms. For example, if data in the memory is ordered, we can use binary search to speed up the query. The method can noticeably increase the performance of handling scenarios involving a large volume of data. Location by ordinal numbers is a similar and more efficient method, which can make use of the high-speed memory random access characteristic to quickly retrieve data from the memory according to the specified ordinal numbers. Both methods are difficult to implement in SQL.
Making the most use of the memory power also refers to the ability to describe data structure. The relational model (SQL) depends on two-dimensional tables to represent the relationship; it is difficult for it to describe and employ more complex data structures (such as multilevel JSON format) that are very suitable for in-memory storage and have advantages in both space utilization and usage efficiency. Frankly, a language needs to support complex data structures to implement the real grouping operation. SQL forcibly performs aggregation after each grouping action, but sometimes it is the result of members of the grouping operation (such as those in each group) that we are concerned with. To achieve the goal, SQL needs to first invent group ordinal numbers using the subquery, producing cumbersome code, repeated queries, and low efficiency.
The root of SQL’s inability to make good use of the features of memory is the theory. When SQL (relational model) was invented, the computer’s memory was very small. Understandably, the language finds it awkward to adapt to the current large-memory environment. Though contemporary databases have implemented many optimizations in engineering and improved the above-mentioned situation to some extent, the optimization engine becomes useless when the scenario gets even a little complicated; after all, it is difficult to compensate for the theoretical shortcomings of engineering. Moreover, the quality and application scope of an optimization engine can only be determined through strict testing and evaluation. Generally, tests are limited to a range of scenarios that SQL can easily handle and do not involve complex ones. This makes the database product selection failure-prone and highly risky.
SPL can solve all the problems
SPL (Structured Process Language) is an open-source computing engine intended for computing structured and semi-structured data. To solve the above-mentioned SQL problems, SPL is designed based on a brand-new model instead of SQL’s relational algebra theory. The language supports both the in-memory and external memory computations. It specifically offers memory optimization methods that help achieve, even surpass, the performance of an in-memory database.
When the volume of data involved exceeds the memory capacity, SPL allows to use the of external memory computation (load data to the memory in batches) and in certain computing scenarios the performance is nearly as good as that of the full-memory computation, enabling high-performance computations even with small available memory space. A unique thing about SPL is that it can integrate with an application by being embedded into the latter to supply high-efficiency computing ability.
SPL also offers the already mature engineering optimization techniques that use the in-memory database, such as the previously mentioned column-wise vector computing, pre-load, and pre-indexing. These general engineering strategies are significant enough to achieve performance as good as that of the in-memory database, but they are not enough to bring into play the memory’s characteristics, which, on the contrary, is SPL’s core strength.
As we said previously, we can make use of the characteristics of memory and computations to achieve higher performance when handling scenarios involving complex data structures or computations. In-memory databases cannot exploit those characteristics because of SQL’s limits.
SPL Can
Unlike SQL’s SELECT which copies data and makes data volume bigger and efficiency lower, SPL only keeps the original records’ addresses, that is, their memory addresses, but does not copy records themselves, creating advantages in both space usage and computing efficiency. The reason SPL can do this is that it has the specialized record type for storing the original data’s memory addresses (references).
SQL, however, does not have the record data type. In SQL, a single record is a data table consisting of a single row; different data tables cannot share records; and the filtering operation copies records to generate new ones to form a new data table, producing unsatisfactory space utilization and time costs.
We all know that the CPU accesses data from the memory via addresses. If we store the addresses of certain data beforehand, it will be very fast to access them later. Take the join operation as an example. If we store addresses of the foreign key table (dimension table)’s records in the fact table, there is no need to perform the join (HASH computations) when trying to use data of the two tables and thus we can achieve an equal performance as that of the single-table query. The pre-association method has obvious advantages in handling scenarios involving data reuse and multi-table association (the number of dimension tables is relatively big). SPL offers such a mechanism to perform a pre-join between the fact table and the multiple dimension tables (store memory addresses of the dimension table’s records in the foreign key field), save the result in the memory, and speed up the computation by using the memory addresses. Find more information HERE.
SQL is based on unordered sets. This makes many high-performance algorithms impossible to implement. To solve the problem, SPL directly offers the ordered sets that enable it to bring into play the full potential of orderliness. For the very complicated celestial bodies contrast computations, the computation amount can be greatly reduced if we first perform initial filtering using the order-based binary search and this makes the subsequent computations more convenient. SQL cannot describe the computation and thus cannot exploit the characteristic of orderliness. In actual practice, it is over 3 magnitudes slower than SPL. Find more information HERE.
SPL can make the most use of the ordinal numbers to perform high-efficiency accesses. For a query task, if the value of the to-be-queried key is the target value’s ordinal number in the table sequence, or if the target value’s ordinal number can be easily obtained according to the to-be-queried value, we can use the ordinal-number-based location method to complete the computation in the constant time interval without comparisons.
What’s more, we can implement the association involving a large fact table (external memory computations) according to the ordinal numbers. When a fact table is too large to fit into the memory, the previously mentioned address reference method becomes useless. In that case, we can convert the fact table’s foreign key values into positions of their corresponding records in the dimension table, which is called numberization. Then we can use the more efficient ordinal-number-based location method to perform searches when trying to create an association between the numberized fact table and the dimension table without any comparisons, producing performance nearly as good as that of the memory address reference method. In SPL, the use of ordinal numbers is an important performance optimization strategy.
We can see that various computing scenarios are taken into account when SPL is designed. It isn’t intended just for memory or external memory; it targets both so that problems of all scenarios can be well handled. Find more information HERE.
Sometimes the computing goal has definite requirements regarding the order. For example, the security industry’s consecutive rising/falling computations require comparisons between neighboring data values. Databases that implement SQL well can use the window function to complete this type of computation. However it is very difficult to express the roundabout method even for simple computations, and the efficiency is slow.
To count the longest consecutive rising days for a certain stock, for example, SPL needs a 3-layer nested query to express the computing process even with a window function, and the execution efficiency is poor (database optimization strategies become useless for complicated scenarios). By contrast, SPL conveniently codes the computation thanks to the support of order-based computations (order-based grouping):
stock.sort(date).group@i(price>price[-1]).max(~.len())
stock.sort(date).group@i(price>price[-1]).max(~.len())
SPL references the direct previous/next record through the relative position.
The performance of SPL’s order-based computation is particularly outstanding in handling the e-commerce industry’s customer churn rate. Find detailed explanations in the article SPL computing performance test series: funnel analysis
SPL also shows its ability to make the most use of memory in describing complicated data structures such as JSON format. It supports the multilevel data structure directly. With the generic type, SPL allows the use of sets as the members of a sequence, that is to say, members of a set are also sets, giving the language the natural ability to describe multilevel data structures such as JSON/XML. This also enables SPL to keep the grouped subsets (a set of sets) and perform a further computation on each grouped subset, effectively avoiding the SQL-style roundabout, nested query and helping achieve higher performance.
By offering all these techniques that enable a full exploit of the memory features, SPL outperforms in-memory databases to achieve a higher computing performance and more concise code for phrasing the computing process.
Do “specialized in-memory databases” Need to Exist?
Not necessary anymore. Though not advertising the concept of in-memory databases, SPL possesses in-memory computing techniques that are more powerful than in-memory databases.
SPL also provides external memory computation ability to handle data exceeding the memory capacity, creating a broader range of application scenarios and combining the in-memory ability and external-memory ability to bring into play the most power. One typical example of this combination is the previously mentioned funnel analysis. A product claiming that it is the in-memory database usually does not have the external memory computation ability; the lack greatly limits the application scope. SPL, however, boasts a wider application scope.
SPL can also work as an embedded computing engine and integrate with the application through its jars. It is deployed together with the application and offers in-memory database-like high-performance computations directly from within the application; it also supports the cooperative use of the in-memory ability and external memory ability.
Having high-efficiency in-memory computing ability, SPL is completely qualified to replace the “specialized in-memory databases”. Together with the external memory ability and the flexible integration ability, SPL gets a broader application scope. It is not important whether a product is called an in-memory database or not when we are examining a product claiming to supply high-performance computation. It is a good product as long as the in-memory computing techniques are awesome and their application scope is wide. According to this rule, SPL is an ideal choice.
You can find more about SPL:
This post is republished with permission from SPL! You can find the original post on Medium and SPL's official website