Sundarrajk's Weblog

Archive for the ‘Performance’ Category

See http://architects.dzone.com/articles/every-programmer-should-know for a comparison of latencies when accessing different parts of the system. This should give an idea of what strategy should be adopted to get better performance.

But even after looking at the latencies people wish to go with “Web Services” even for applications which are hosted in the same data center and have been developed by the same team and in the same technology, just because it is the “in thing”.

To use web services one needs make the following conversion
1. On the client before making the request convert the native format to strings to form the XML to be sent.
2. On the server on receiving the XML request convert to native format for better processing
3. On the server after getting the results convert the native back to string so that it can be sent over XML
4. On the client after receiving convert the string in the XML to native format for consumption.

The steps 1 and step 4 may not be mandatory in some scenarios. Steps 2 and 3 can almost not be avoided. These steps need to be executed irrespective of the protocol that is being used. These conversions are expensive and should be avoided when one is writing applications in the same technology and more so when the applications are hosted in the same data center (read LAN)

In an extreme, one of the projects had the following data flow:
1. User from the Desktop Browser makes a request
2. The Application Server converts this into a Web service request and posts it to another server.
3. This Web Service server makes an EJB call to another server.
4. The EJB in turn invokes one or more stored procedures which actually have the business logic, data access and everything else.
5. The data is obviously returned through the same path.
There is no technology other than Java involved in any of the layers above and they are all hosted in the same data center.

Now the same functionality over a mobile has been implemented as follows:
1. User from the Mobile Application makes a request
2. This is sent to the server as a REST request.
3. This REST request from mobile is converted and another internal REST request is made.
4. This internal REST request is converted into a Web service request and posted to another server.
5. This Web Service Server makes an EJB call to another server.
6. The EJB in turn invokes one or more stored procedures which actually have the business logic, data access and everything else.
7. The data is obviously returned through the same path.

Is it any surprise that this application is facing performance, stability, availability and reliability issues?


SQL Performance – 101

1.       Fetch only the required columns. Do not use select *. Do not fetch the columns from the database and ignore them in the code.
a.       This is to minimize the network traffic between the database server and the server on which the data is being processed.
b.      It will reduce the memory utilization on the database server and on the server where the results are being processed.
2.       Fetch only the required rows. Use the proper “where” clause to limit the number of rows retrieved. Do not filter out records in the code.
a.       This will reduce the disk I/O in the database server.
b.      This will minimize the network traffic between the database server and the server on which the data is being processed.
c.       This will reduce the memory utilization on the database server and on the server where the results are being processed.
d.      Use the “LIMIT”/”TOP” clause as available in the database to limit the number of rows retrieved. These can come very handy for paginating data on the screen. This can come in handy when working with export of large amounts of data too.
3.       Join only minimum required tables. As a thumb rule do not join more than three tables.
a.       The joining of any two tables is typically a Cartesian product of the two tables. This means that if we join two tables which have 10000 records each at some stage the database will be dealing with 100,000,000 i.e. 100 million! records, and if we join three such tables we will making the database handle 1,000,000,000,000 1 trillion records. To avoid overloading the database server one should try and join minimum number of tables.
b.      Note that the limit of three has nothing to do with a limit of 1 trillion records. The latest techniques of fetching database use hash joins which are much more efficient than Cartesian joins and one will be better off if the database uses this instead. But one cannot tell the database to use Hash Joins instead of Cartesian products. One needs to use the explain plan to determine the joins that the database uses and plan according.
c.       Do not hesitate too much to denormalize the tables if it is found that one is always ending up joining too many tables.
d.      Leverage the memory of the process which is processing the records to cache certain master data, list of value types of data so that one can avoid joins with the tables that contain such data. E.g. a typical table is most application will be a StatusMaster. Now this table will have a list of statuses and the description for these statuses. To show the data to the user on the screen one may end up joining with this table all the time. This can be avoided if one has kept a hash table of the statuses in memory of the process. When the records are fetched on the status id is obtained from the table, but one can look up the hash table to get the description to be shown to the user.
4.       Do not use Distinct unless absolutely necessary.
a.       When the database finds a distinct clause it has to fetch the required rows, it has to then sort it on the columns that have been marked distinct and it has to remove the duplicates. If we are sure that the values will be distinct then do not use the keyword distinct. If possible try and change the application and if necessary the database to ensure that the distinct keyword can be avoided. This is important in table which have large amount of data.
5.       Do not use Order By unless absolutely necessary.
a.       If an order by clause is specified then the database needs to fetch the records and then it needs to sort the data. This is an extra overhead for the database. Do not use the sort unless absolutely necessary.
6.       Do not use column functions with a column in the where clause.
a.       If a column function is used along with a column in the where clause then any index on the column cannot be leveraged and this will make the query slower. E.g. where toupper(name) = ?. Programmatically ensure that it is not necessary have a column function in the where clause. If necessary add an additional column on which the search can be executed.
7.       Similarly do not convert data types from one to another, especially in where clauses. If the column is an integer then use an integer value to compare do not use a string value and compare, even though the database may have the ability to do it. Data Type castings/conversions are expensive.
8.       Do not use “in” with a sub-query that will return a large number of values in the “in” clause. Instead use “exists” clause if possible.
a.       If the in clause has a large list of values then the database finds it difficult to compare the value from the row to the values in the list. This slows down the query drastically. Instead use a join to fetch the data in such scenarios.
9.       Do not use a query like select … from tableb b1 where columna = (select max(columna) from tableb b2 where b1.columnx = b2.columnx).
a.       This kind of query is typically required to be executed in tables where we need to pick up the latest record from the history of records. Instead of this adopt one of the following strategies
                                                   i.      Maintain the history in a separate table so that the main table has only the latest information and so the correlated subquery is not required.
                                                 ii.      Maintain a flag in the table which will indicate which is the latest record. When a new version of the record is being entered then update the flag in the earlier record to indicate that it is not the latest and in the new record mark this flag to indicate that this is the latest. Include the flag column in the appropriate index.
10.   Query with integers and numbers wherever possible. Avoid querying by Strings. It is easier to compare two numbers compared to comparing two string.
11.   Avoid usage of “like”. Especially avoid usage of wildcard on both sides of a string column. E.g. name like “%smith%”.

Some Design Considerations/Optimizations

12.   To generate sequences for a primary column or for a unique column where the temporal sequence of these numbers do not matter and one can tolerate gaps use the auto-increment feature that is available in most databases. In case of Oracle use “sequences” to generate the values for these columns.
a.       In Oracle create the sequence with the keywords “cache” and “noorder”. Along with “cache” specify an integer number which specifies the number of sequence ids that should be cached. This will be important for tables where this number needs to be generated very frequently.
13.   Create indexes judiciously. If a table has too many indexes then the inserts and updates into the tables will become slower as it now needs to update the indexes too.
14.   When creating an index ensure that the cardinality of the values in the index is not too small. “Cardinality” of a column indicates the uniqueness of the values in the column. More unique the data in the column higher the “Cardinality” of the column. Creating an index in a column which has a very low “Cardinality” does not help as the database will not be able to use this column effectively to fetch the rows. E.g. if there is a column which contains only a “Y”/”N” value or has a list of status codes, creating an index on such columns will not help query optimization.
15.   Use the clustered index option only if required. An ideal clustered index should reflect the temporal order in which the records are to be inserted into the table. If this is not the case then the inserts and updates can become slow as the data in the table increases.
16.   Use the smallest “standard” datatype for the columns. This is important because if one uses an unnecessarily large datatype then one will end up using up unnecessary memory on the database server everytime the records are processed. E.g. try and use INTEGER or a SMALLINT for all columns which are Integers. Do not use a bigint because it is too big and do not use a byte as it may not be supported on all the databases.

Summary

To get a good performance from the database we need to follow the following norms:
1.       Keep the design simple
2.       Keep the queries simple and ensure that the database does not have to do something that is not necessary
3.       Fetch only that data that is required
4.       Normalize but do not forget to denormalize where required

Introduction

The topic will seem to be a sacrilege to quite a few members of the architect and designer community. Stored Procedures have been considered to be a panacea for all performance problems in the RDBMS world. Of late they also have been considered as the panacea to prevent SQL injection in applications. Given this background it will not be surprising to see many architects and designers react with shock and awe looking at the heading of the topic.

Advantages/Benefits of Stored Procedures

First let us explore some of the advantages of Stored Procedures.
1. Usage of Stored Procedures prevents SQL injection, unless one uses dynamic queries within the Procedures themselves.
2. The data is processed within the database server and only the data that is really required to be sent to the client travels over the network. This optimizes the network traffic between the Database Server and Client and makes the performance better.
3. It is easier to leverage features like cursors and temporary tables to optimise the processing of the data.
4. All the applications that can connect to the database server can execute the business processes by executing the stored procedures in the database. This leads to the belief that stored procedures eases in reusability of code across applications.

The Past

Given these advantages it is not surprising that many consider stored procedures to be a panacea to many of the problems that they face. These benefits are(read were) genuine and were a boon in the days past.
Till a decade ago the scalability requirements of the applications were limited. A bulk of the processing happened in batch processes and the users of the Online Transaction Processing transactions were limited. Applications typically followed the client server paradigm where the client managed the User Interactions and the Database managed the Business Processes.
There was a lot going in the favour of stored procedures in this scenario. A big sized server could manage the load that was generated by the online transaction processing.

The Present

The scale of operations in today’s applications has drastically changed. The applications today are not exposed to a limited number of users like the applications of the bygone era. Today’s applications are exposed on the internet and are accessed via browers, tablets and mobile phones. If one considers a banking application, in the past, this access and usage of this application was limited to the employees of the bank. Today the same banking application is exposed on the internet to the customers of the bank. A variety of functionality of the application are exposed through the internet. The customers can enquire their balance, they can make a payment to a vendor via net banking, they can transfer funds to another person and for a variety of other features are available to the customers. This has tremendously increased the load on the application.
Before we proceed it is important to remember the fact that stored procedures execute within the database server and can leverage resources in that server.
If now, in an application a majority of the business logic is in stored procedures, every user who will be using the application will be using the resources in the database server. So as the number of users and correspondingly the number of transactions in the applications go up the load in the database servers go up. Soon we will reach a stage where the database server reaches its capacity. The only way now to scale the application will be to increase the sizing of the server on which the database is running. Databases scale only VERTICALLY and not HORIZONTALLY (Yes, I hear you, we will come to Oracle RAC in a short while)
Instead of having all the logic or majority of the logic in the databases, one were to put them in the “Application Servers” and leverage features like distributed caching it will be possible to scale the application horizontally. The database will be used primarily for storing and retrieving the data required for the data. To scale the application it will be possible to increase the number of “Application Servers” on which the application runs. In this way one can almost scale indefinitely.
The retrieval of data from the database server to the application server will add some inefficiency of the processing, but this will offset by the fact that one will be able to scale the application support a much larger load.
The cost of ownership also comes down as one can use normal servers as “Application Servers”. One does not require high cost specialized servers for this purpose.
To get the efficiency of a stored procedure one can use PreparedStatement in the Java world and Prepared Commands in the .Net world. Using PreparedStatements and Prepared Commands, the right way, one can also shield the application from SQL injection.

Other perils of stored procedures

1. If one makes a heavy usage of stored procedures then one will be stuck with the particular database for the life of the application. Migrating to another database will be an expensive proposition.
2. This becomes an even bigger limitation in case of products. The customers wish to have the ability to be able to run the product using a database of their choice rather than the vendor of the application deciding the underlying database. If we use stored procedures in the product we will end up maintaining stored procedures per database and this will be an expensive proposition.
3. The code in stored procedures is not conducive for efficient documentation even if one uses good coding practices and we end up having bad code smells which are not easy to wish or whiff off.
4. The code in stored procedures is not conducive for refactoring which means that the code will continue to smell.
5. The newer techniques of using interfaces and/or web services to expose features of an application to another application are a much better way of reusing the business logic rather than using the stored procedures from multiple applications. These mechanisms give a more elegant mechanism to loosely couple applications than stored procedures.

Note on Oracle RAC

Oracle RAC has the capability of scaling horizontally. But it comes with its own set of complexities and limitations. Databases leverage the memory on the servers to cache data to give improved performance. This is possible and it is very efficient when the database server is running on a single server as all the changes to the data pass through the same process and it is possible to update the cache with the changed data or invalidate the right section of the cache.
When one starts running the database servers on more than one physical server it becomes important to ensure that any change to the data is conveyed to all the caches on the various servers or the corresponding caches are invalidated on all the servers or before using the data in the cache one will need to validate if the data has changed through the process in the other server. This is not easy and adds to the overhead of processing thus reducing the benefit that ones gets from the caching mechanisms used by the traditional databases.

A note on NoSQL

Today’s applications have reached a scale where even the most powerful RDBMS is unable to scale to the levels to traffic experienced in these applications. Even if it scales, the cost of the hardware required to support these loads, is prohibitive.
Google, Facebook, Twitter and many sites which deal with large data (read Big Data) all use RDBMS only to a limited extent. They instead use a new breed of database servers classified as NoSQL. The term NoSQL itself has no expansion. It was used as a twitter hashtag to denote the set of database applications which do not use SQL to provide access to the data that is stored. Also they do not store the data in the traditional way where the data is organized in tables and columns. NoSQLs will be a topic for another post sometime in future.
This further strengthens the reasons to use stored procedures only in really exceptional situations.

Conclusion

There are situations where stored procedures are still a possible solution. But given the scale of operations of today’s applications it is advisable to consider stored procedures only in really, really exceptional situations. It is time we understand and learn better ways of developing applications that perform and are secure rather than sticking to the older practice of using stored procedures to achieve performance.
In the CMG meet at TCS, Mumbai Mr. B. Gopal on 4th October used an apt phrase “Conservation of Minimum Complexity” to mean that any system has a minimum level of complexity and it not possible to reduce this complexity any further. The aim should be distribute the complexity across the various layers of the system. When one makes excessive usage of stored procedures one is increasing the complexity of the database layer and this will is not be good in the long run both from a performance as well as from a maintenance perspective.

 

Today’s web applications have to cater to a wider audience, they have to serve richer pages and they have to perform better than they did in the early days.
One of the highly recommended techniques for improving performance is to cache as much data as possible.
And it is possible to allocate large memory to processes running in the current machines (read 64 bit operating systems on 64 bit CPUs). But the one road block that holds technicians from providing a very high heap memory to Java is Garbage Collection.
Java introduced the concept of Garbage Collection to reduce the pains of developers of managing memory and does a fairly good job. But it fails to deliver when it has to deal with large memory cleanups. Every JVM that exists today pauses the applications one time or the other to clean up the garbage that has been generated and during these pauses mean the response to the requests of the users are slowed down drastically. This is an unacceptable situation for most applications.
At a high level there are two ways the heap is treated
1. The whole heap is treated as one and memory is allocated and freed as required.
2. The heap is split into multiple regions based on the age of the object. Objects are allocated in the young region and are promoted to the tenured (Old generation) region as their life increases. This is also called the generational heap and the three major JVMs today prefer this mechanism to manage the heap with some variations. Coincidentally the .Net CLR also uses a similar mechanism to manage the heap.

There are different types of collectors that exist
1. Serial – This is the one with the most pause time
2. Parallel – This reduces the pause times based on the number of CPUs available
3. Concurrent Mark and Sweep – This further reduces the pauses as it tries to do some work during the free cycles of the application threads and starts working before the memory gets exhausted.

Most JVMs and most collection mechanisms do well in collecting the memory in the young generation regions. The problem starts when they have to collect data for the tenured generation as the objects are typically inextricably linked by the time it comes to this generation. Also this generation is of the biggest size and it contains more number of objects than the other generations. This means that it takes longer to trace the roots of the objects in this generation.

Consider the scenario in which the application caches data. This would typically get allocated in the your generation and would slowly progress to the tenured generation. If the right objects are cached then these objects will not change often. At the same time there will be other objects which will come to the tenured generation and these may not be of this nature. Now when a tenured generation collection is triggered the JVM not only has to walk through the ones that will potentially die, but it also has to go through the objects which are very unlikely to die. It is possible that the ones that will not die will occupy a larger space and will be more in number than the ones that are likely to die. This means that most of the work done in the Garbage Collection Cycle is redundant.

Now instead of this if the JVM provided a separate memory section for caching then what will happen is that the tenured will be only those objects that have stayed long for some right or wrong reason. And in a well written program these will be very few. The collection objects in this tenured area should be faster as now the objects in cache will be excluded from this Garbage Collection Cycle and this should as per our assumption be the biggest contributor to the objects in the tenured area.

The added advantage will be that the objects identified for caching will directly go to the cache region and will never trouble the young generation collection or the tenured generation collection. To handle references of the cache objects in the other areas we could maintain a reference counting algorithm which can decrement the reference counts at the time of Garbage Collection Cycles in those areas. This will ensure that one does not have to walk the young and tenured regions when running a Garbage Collection cycle in the tenured region. It will be better if we do not allow reference of these objects to be maintained by objects allocated in the other regions other than within a method.

The memory management of this area can be further enhanced by dividing this section into the following subareas:
1. Area with objects that never die
2. Area with objects that can be removed using LRU algorithm
3. Area with objects that can be removed using LFU algorithm

With ability to tune the sizes of these regions.

Now with this it will be possible for the application developers to allocate and manage a huge cache within a single JVM which is difficult today given the limitations of Garbage Collection Cycle.


Categories