Sundarrajk's Weblog

Archive for the ‘Software’ Category

See http://architects.dzone.com/articles/every-programmer-should-know for a comparison of latencies when accessing different parts of the system. This should give an idea of what strategy should be adopted to get better performance.

But even after looking at the latencies people wish to go with “Web Services” even for applications which are hosted in the same data center and have been developed by the same team and in the same technology, just because it is the “in thing”.

To use web services one needs make the following conversion
1. On the client before making the request convert the native format to strings to form the XML to be sent.
2. On the server on receiving the XML request convert to native format for better processing
3. On the server after getting the results convert the native back to string so that it can be sent over XML
4. On the client after receiving convert the string in the XML to native format for consumption.

The steps 1 and step 4 may not be mandatory in some scenarios. Steps 2 and 3 can almost not be avoided. These steps need to be executed irrespective of the protocol that is being used. These conversions are expensive and should be avoided when one is writing applications in the same technology and more so when the applications are hosted in the same data center (read LAN)

In an extreme, one of the projects had the following data flow:
1. User from the Desktop Browser makes a request
2. The Application Server converts this into a Web service request and posts it to another server.
3. This Web Service server makes an EJB call to another server.
4. The EJB in turn invokes one or more stored procedures which actually have the business logic, data access and everything else.
5. The data is obviously returned through the same path.
There is no technology other than Java involved in any of the layers above and they are all hosted in the same data center.

Now the same functionality over a mobile has been implemented as follows:
1. User from the Mobile Application makes a request
2. This is sent to the server as a REST request.
3. This REST request from mobile is converted and another internal REST request is made.
4. This internal REST request is converted into a Web service request and posted to another server.
5. This Web Service Server makes an EJB call to another server.
6. The EJB in turn invokes one or more stored procedures which actually have the business logic, data access and everything else.
7. The data is obviously returned through the same path.

Is it any surprise that this application is facing performance, stability, availability and reliability issues?


SQL Performance – 101

1.       Fetch only the required columns. Do not use select *. Do not fetch the columns from the database and ignore them in the code.
a.       This is to minimize the network traffic between the database server and the server on which the data is being processed.
b.      It will reduce the memory utilization on the database server and on the server where the results are being processed.
2.       Fetch only the required rows. Use the proper “where” clause to limit the number of rows retrieved. Do not filter out records in the code.
a.       This will reduce the disk I/O in the database server.
b.      This will minimize the network traffic between the database server and the server on which the data is being processed.
c.       This will reduce the memory utilization on the database server and on the server where the results are being processed.
d.      Use the “LIMIT”/”TOP” clause as available in the database to limit the number of rows retrieved. These can come very handy for paginating data on the screen. This can come in handy when working with export of large amounts of data too.
3.       Join only minimum required tables. As a thumb rule do not join more than three tables.
a.       The joining of any two tables is typically a Cartesian product of the two tables. This means that if we join two tables which have 10000 records each at some stage the database will be dealing with 100,000,000 i.e. 100 million! records, and if we join three such tables we will making the database handle 1,000,000,000,000 1 trillion records. To avoid overloading the database server one should try and join minimum number of tables.
b.      Note that the limit of three has nothing to do with a limit of 1 trillion records. The latest techniques of fetching database use hash joins which are much more efficient than Cartesian joins and one will be better off if the database uses this instead. But one cannot tell the database to use Hash Joins instead of Cartesian products. One needs to use the explain plan to determine the joins that the database uses and plan according.
c.       Do not hesitate too much to denormalize the tables if it is found that one is always ending up joining too many tables.
d.      Leverage the memory of the process which is processing the records to cache certain master data, list of value types of data so that one can avoid joins with the tables that contain such data. E.g. a typical table is most application will be a StatusMaster. Now this table will have a list of statuses and the description for these statuses. To show the data to the user on the screen one may end up joining with this table all the time. This can be avoided if one has kept a hash table of the statuses in memory of the process. When the records are fetched on the status id is obtained from the table, but one can look up the hash table to get the description to be shown to the user.
4.       Do not use Distinct unless absolutely necessary.
a.       When the database finds a distinct clause it has to fetch the required rows, it has to then sort it on the columns that have been marked distinct and it has to remove the duplicates. If we are sure that the values will be distinct then do not use the keyword distinct. If possible try and change the application and if necessary the database to ensure that the distinct keyword can be avoided. This is important in table which have large amount of data.
5.       Do not use Order By unless absolutely necessary.
a.       If an order by clause is specified then the database needs to fetch the records and then it needs to sort the data. This is an extra overhead for the database. Do not use the sort unless absolutely necessary.
6.       Do not use column functions with a column in the where clause.
a.       If a column function is used along with a column in the where clause then any index on the column cannot be leveraged and this will make the query slower. E.g. where toupper(name) = ?. Programmatically ensure that it is not necessary have a column function in the where clause. If necessary add an additional column on which the search can be executed.
7.       Similarly do not convert data types from one to another, especially in where clauses. If the column is an integer then use an integer value to compare do not use a string value and compare, even though the database may have the ability to do it. Data Type castings/conversions are expensive.
8.       Do not use “in” with a sub-query that will return a large number of values in the “in” clause. Instead use “exists” clause if possible.
a.       If the in clause has a large list of values then the database finds it difficult to compare the value from the row to the values in the list. This slows down the query drastically. Instead use a join to fetch the data in such scenarios.
9.       Do not use a query like select … from tableb b1 where columna = (select max(columna) from tableb b2 where b1.columnx = b2.columnx).
a.       This kind of query is typically required to be executed in tables where we need to pick up the latest record from the history of records. Instead of this adopt one of the following strategies
                                                   i.      Maintain the history in a separate table so that the main table has only the latest information and so the correlated subquery is not required.
                                                 ii.      Maintain a flag in the table which will indicate which is the latest record. When a new version of the record is being entered then update the flag in the earlier record to indicate that it is not the latest and in the new record mark this flag to indicate that this is the latest. Include the flag column in the appropriate index.
10.   Query with integers and numbers wherever possible. Avoid querying by Strings. It is easier to compare two numbers compared to comparing two string.
11.   Avoid usage of “like”. Especially avoid usage of wildcard on both sides of a string column. E.g. name like “%smith%”.

Some Design Considerations/Optimizations

12.   To generate sequences for a primary column or for a unique column where the temporal sequence of these numbers do not matter and one can tolerate gaps use the auto-increment feature that is available in most databases. In case of Oracle use “sequences” to generate the values for these columns.
a.       In Oracle create the sequence with the keywords “cache” and “noorder”. Along with “cache” specify an integer number which specifies the number of sequence ids that should be cached. This will be important for tables where this number needs to be generated very frequently.
13.   Create indexes judiciously. If a table has too many indexes then the inserts and updates into the tables will become slower as it now needs to update the indexes too.
14.   When creating an index ensure that the cardinality of the values in the index is not too small. “Cardinality” of a column indicates the uniqueness of the values in the column. More unique the data in the column higher the “Cardinality” of the column. Creating an index in a column which has a very low “Cardinality” does not help as the database will not be able to use this column effectively to fetch the rows. E.g. if there is a column which contains only a “Y”/”N” value or has a list of status codes, creating an index on such columns will not help query optimization.
15.   Use the clustered index option only if required. An ideal clustered index should reflect the temporal order in which the records are to be inserted into the table. If this is not the case then the inserts and updates can become slow as the data in the table increases.
16.   Use the smallest “standard” datatype for the columns. This is important because if one uses an unnecessarily large datatype then one will end up using up unnecessary memory on the database server everytime the records are processed. E.g. try and use INTEGER or a SMALLINT for all columns which are Integers. Do not use a bigint because it is too big and do not use a byte as it may not be supported on all the databases.

Summary

To get a good performance from the database we need to follow the following norms:
1.       Keep the design simple
2.       Keep the queries simple and ensure that the database does not have to do something that is not necessary
3.       Fetch only that data that is required
4.       Normalize but do not forget to denormalize where required
In the following code

package com.tcs.ngps.htmlparsers

import org.jsoup.Jsoup
import org.jsoup.nodes.Document

class JSoupParser {

    public JSoupParser() {

    }

    static main(args) {
        String lString = (new File(args[0])).getText()
        Document lDocument = Jsoup.parse(lString)
        lDocument.outputSettings().prettyPrint(true);
        //(new File(“D:/Temp/AWRJSoupParsed.html”)).newWriter().write(lDocument.toString())
        def lBufferedWriter = new BufferedWriter(new FileWriter(new File(“D:/Temp/AWRJSoupParsed.html”)))
        lBufferedWriter.write(lDocument.toString())
        lBufferedWriter.close()
    }
}The above code reads an HTML file and converts it to a XML file using JSoup parser. It then writes the XML to a file.
Now if the code 
(new File(“D:/Temp/AWRJSoupParsed.html”)).newWriter().write(lDocument.toString())
is used to write the file to the output the output file is always smaller than the total length of the string (when the String is large). Only when one uses the following code
def lBufferedWriter = new BufferedWriter(new FileWriter(new File(“D:/Temp/AWRJSoupParsed.html”)))
lBufferedWriter.write(lDocument.toString())
lBufferedWriter.close()

Is the whole file written. Unable to figure out why.

Beautiful Data: The Stories Behind Elegant Data Solutions (Theory In Practice, #31)Beautiful Data: The Stories Behind Elegant Data Solutions by Toby Segaran
My rating: 3 of 5 stars

Beautiful Data Edited by Toby Segaran and Jeff Hammerbacher

Found the Chapters 7 (Data Finds Data), 14 (Natural Language Corpus Data) and 20 (Connecting Data) to be very interesting. The others will be interesting based on one’s areas of interest.

Chapter 1 – Seeing your life in Data
The first chapter is a description of two projects by Yahoo. The first one captures the area through which the user is traveling. The aim is to show the user how much of pollution the user has been subject to. The challenge was to show the exposure levels along with the movements and stopping of the user. A variety of techniques were used before finally arriving at a way of overlaying the exposure as a coloured line and showing the locations where the user stayed for a long time as circles. The colour of the line/circle indicated the exposure levels.
The second project involved analyzing the activity tweets of the users and give them a perspective of their behaviour. The challenge in this was to make the user tweet their activities. This was achieved by asking the user to set a goal and then tweet so that the application could analyze the activities of the user so as to provide hints on what were the positive and negative activities.

Chapter 2 – The Beautiful People: Keeping users in mind when designing data collection methods
The second chapter is a project that was carried out to request for a survey of old people on the usage of a product that was to be launched. The challenges were, how to survey a population that is not comfortable with computer; how to survey a population whose vision may be deteriorating due to old age; how to induce them to respond to a survey; how to identify the geography from which they are responding?
To induce the responders to go through the survey the survey was kept as small as possible. The questions were displayed such that the users did not get a feeling of having to answer too much or too little. The fonts were selected so that the question stood out. The questions were worded to avoid all ambiguity. The geographic region was determined by using the IP addresses from where the users responded. In effect the project manage to garner response far beyond the normal response to any survey.

Chapter 3 – Embedded Image Data Processing on Mars
The third chapter presents the challenges of Image Processing on board the Mars rover. The processing power was a challenge as one worked with toughened CPUs and not the normal CPUs (CPU clock speed was only 20 MHz). The RAM was also limited to a few MBs. The only storage was some flash memory. The VxWorks Real Time Operating System was used as the OS. The code was written in C. The main functionality was to store the images captured by the on board cameras (each image was 1 Mega Byte in size), process them for any errors, and downlink it to earth when the connectivity was available.
The final solution involved building a queued system where the image captured by the camera was stored in a section of the flash drive. The image processor picked up this image and cleaned it up. The downlink module then picked this up and sent it to the earth and after successful downlink marked this space as available for the next image. All this happened without the position of the image being shifted in the flash drive.

Chapter 4 – Cloud Storage Design in a PNUTSHELL
The fourth chapter describes the design of PNUTShell a distributed database at Yahoo, used for a variety of purposes. The key requirement was to have geographic replication and distribution of the data with minimum latency. Some of the principles that were followed were:
1. Each record had a mastership, which would the place where it would get updated. It will change only if it is observed that the update is taking from another geographic location consistently for a long period of time.
2. The tables were not modeled in the traditional but were logically grouped together based on the expected access.
3. The Replication order was fixed from one geography to another and was not random. This ensured that the replication traveled minimum distance.
4. The system provided the application the ability to choose between availability and consistency on a per table basis. So if the main replica goes down the application can choose to make the table available and choose to sacrifice consistency or it can stall the functionality and choose to sacrifice availability till the main partition is up.
5. Because of the data structure being used certain complex queries were not possible.

Chapter 5 – Information Platforms and the Rise of the Data Scientist
The Fifth chapter talks of the rise of Data Scientist. These are people expected to be able to take a vast amount of data, process it quickly and make sense out of this data. The chapter talks about how the data about access patterns being collected at Facebook was initially setup in MySQL, how it moved to Oracle and finally had to be moved to Hadoop as the data being generated grew in size.

Chapter 6 – The Geographic Beauty of Photographic Archive
The sixth chapter talks of an exercise carried out in England where the users were asked to take photographs of the locality that they move around in and geotag them and provide tags to denote the topography of the terrain. These images were then used to create tree maps which provided insights into the geography of England as well as the differences in the semantics used by the people in the different regions.

Chapter 7 – Data finds Data
The seventh chapter speaks about a very interesting concept of Data find Data. Some of the examples quoted are
1. A guest calls the reception at 8:00 AM asking them to give a wake up call at 12:00 PM. But soon the maid knocks the door for housekeeping. If “data could find data” then the maid’s schedule would have been altered so that the guest is not disturbed.
2. A user searches for a soon to be released book which is not yet available on Amazon. Now if Amazon can keep track of this search and can notify the user when the shipment of books arrives then the user could buy it if interested. If this notification did not go through then the user may come back after a month only to find that the book is sold out.
3. A parent checks a website for the safety of a particular street as her child walks to the school through the path. There is nothing to report on the street. Now if the website can keep track of this inquiry and if an incident occurs, notify the parent of the incident the parent would be altered. Otherwise the parent may never notice till the next time she comes and check the site.
4. Government departments do not generally share data amongst each other given the security, privacy requirements. This prevents identification of patterns. E.g. the same person being involved in terrorism and narcotics could show up if these departments use “Data finds Data” principle and look up each other’s databases, possibly in a restricted, secure fashion.
A federated search across the various data stores is one way to achieve this, but this can be tedious and long drawn process considering the amount of data that one needs to parse. The way to achieve this would be extract and classify the data and then act on this data. A system that needs to implement this feature needs to have the following blocks:
1. The existing of, and availability of, observations
2. The ability to extract and classify features from the observations
3. The ability to efficiently discover related historical context
4. The ability to make assertions (same or related) about new observations
5. The ability to recognize when new observations reverse earlier assertions.
6. The ability to accumulate and persist this asserted context
7. The ability to recognize the formation of relevance/insight
8. The ability to notify the appropriate entity of such insight.

Chapter 8 – Portable Data in Real Time
The eighth chapter is about how Gnip uses event driven mechanism, rather than polling mechanism, to gather social data, “clean” and “normalize” it provide it to the subscribers. The key take away from this how bad polling for data is, compared to data being pushed to the subscribers.

Chapter 9 – Surfacing the Deep Web
The ninth chapter is about how the search engines only scanned the static contents of the websites and failed to scan “deep”. E.g. scanning a second hand car website for all models available. Typically this will involve the user providing some inputs like in which state is she looking for second hand cars, the models that she is looking for, the price range she is looking for. After this the user will need to click a button to get the list of matching cars. A normal search engine bot will not do any of this. All it will indicate to somebody searching for second hand cars is that here is a site that sells second hand cars. If instead it were able to dig “deep” into the site by simulating a real user and index these pages too, then it will be possible for this search engine to provide more details. But it is not easy to achieve this. As one needs to think about what parameter values should one search with. The permutation/combination of parameter values will be too many, but most may not make sense. The chapter talks about a way by which they first try and probe the site with generic values like “*” or blank for parameter values. Parse the output of this and try and gauge the actual parameters that can be passed to it to get meaningful data. The idea is not to get every single data stored behind the scenes, but to try and cover most types of data that is hidden and index them so that the search engine is enriched.

Chapter 10 – Building Radiohead’s House of Cards
The tenth chapter is about how Radiohead’s music video “House of Cards” was made without any camera, but using only data captured using two main equipments which measured the amount of light reflected back by the environment in which it is operated. The whole data is captured as a set of XY coordinates and the intensity of the point. This data is available at http://code.google.com/radiohead, for people to play around with and create their own videos and effects.

Chapter 11 – Visualizing Urban Data
The eleventh chapter is about how a website was built for the residents of Oakland, Chicago by scrapping data from the Police department and displaying it on a map so that it made it easier for the residents to track crime in the streets of Oakland. Similar to the first chapter, this involves displaying the crimes on a map. The challenges are similar to what was seen in the first chapter. The site was briefly shutdown as the police department figured out what these people were doing and stopped access to their websites. The scrapping of site stopped when finally the police department relented and started giving them data at the end of the day in an Excel sheet.

Chapter 12 – The Design of Sense.us
The twelfth chapter is a project that took the US census data for the 100+ years and built visualizations around it. Some interesting concepts were:
1. Usage of stacked line charts to indicate different percentations. E.g. the percentage of population engaged in a particular field of work, or percentage of population by place of birth. These show the trends over a period of time. These can further drilled down as another set of stacked line charts.
2. A population pyramid was a chart where the left side represented the statistics of male population and the right side show the population of female population. The Y axis was the age group and a year slider provided the user to visualize the change in the population by age group over a period of time.
3. Doubly linked discussion. The project provided the ability to the users to annotate the views. These were linked to the current view that user was seeing. The users could navigate from the comments to the views and views to the comments. Hence doubly linked.
4. A provision was given to annotate the graphs graphically. So users could over lay the view with lines, circles etc and could share these with others.
5. There was also a feature by which the views could be collected by the user for later viewing or for sharing with others. This involved the user just indicating that she wished to “Add View” to her graphical list of bookmarks.
This project was not released to the outside world. Instead the technology was used by IBM to build the site many-eyes.com and can be used by users to create their own visualization, upload their data and visualize this data using the visualization that was created by them.

Chapter 13 – What Data Doesn’t Do
This chapter talks about scenarios which cannot be detected using data. Somethings which seem trivial to the eye will not be easy to figure out using data. E.g. presence of an ugly duckling amongst a set of swans in an image can be immediately spotted by the eye, but detecting the same with the data that represents the image is not easy.
Simlarly reading the line “Iamnotgoingtocomehometoday” is easy for the eye, but not easy for the computer.
Similarly there are scenarios where the computer will be able to figure out trends more easily than the human eye can. E.g. from a complex scatter plot it will be difficult for the eye to detect a trend whereas using mathematically techniques a computer would be able to find a suitable trend.
Sometimes one can tend to be misled through a biased “narrative”. This is called as “narrative fallacy”. E.g. if one is shown a set of graphs and is told that these represent the stock prices of three companies in the manufacturing industry and if one is asked to pick the stock that will perform well in the coming days one would tend to guess based on the data provided. Humans tend to builda story around the data to support their conclusions.
The tendency to apply a past conclusion to the present analysis is called “confirmation bias”.
The author states that data does not necessarily drive one in the right direction because
1. Our tools for using data are inexact.
2. We process data with known biases.
The author also makes a set of statements:
1. More data isn’t always better: The argument is that this applies well for data that has a normal distribution, but not all data show normal distribution and so it does not necessarily apply in all scenarios.
2. Mode data isn’t always easy: Capturing, storing and processing large amounts of data is not easily done even given the advancement that we have seen in the processors.
3. Data along doesn’t explain: The author argues that “given two variables correlated in a statically significant way, causality can work forward, backward, in both directions or not at all”. The author cites an example of how an article in Wall Street Journal had suggested that since “premartial cohabitation is correlated with a higher rates of divorce, unwed couples could avoid living together in order to improve their chances of staying together after marriage”. A very skewed conclusion at best.
4. Data isn’t good for a single answer: Analysis of data does not lead to a single conclusion most of the times. It usually points to possible conclusions.
5. Data doesn’t predict: In a controlled environment it is possible to predict an outcome with near certainity, but in domains with less certainity, such as human or physical behaviour, modeling is an important tool to help explain patterns and in the eagerness one can tend to overfit a model.
6. Proabability isn’t intuitive: The author cites and example of how probability is not always intuitive. The author states “when using data to answer a question, we don’t know what evidence to exclude and how to weigh what we include”.
7. Probabilities aren’t intuitive: The author states that when we dealing with multiple probabilities, it becomes even more tricky and one tends to get biased by prior experience.
8. The real world doesn’t create random variables: Sometimes one can carried away by statistics, forgetting that statistics is not laws of nature. This can lead to very wrong conclusions. In the real world there is lot of interconnection and the data observed is not random independent values.
9. Data doesn’t stand alone: It is not easy to make a decision based only one data. E.g. when one has to take a decision on whether to give a loan to a person or not, it is not only the financial credentials that influences the decision, but factors like the social background of the applicant and the approver also influence the outcome.
10. Data isn’t free from the eye of the beholder: The same data viewed and analyzed by different persons can lead to different conclusions because of their personal cognitive biases.

Chapter 14 – Natural Language Corpus Data
This chapter talks about how the corpus of natural language words (tokens) that has been accumulated by Google over the years can be effective used to improve natural language text interpretation. One exmaple that the author illustrates is “Word Segmentation”. This involves interpretting a phrase like choosespain.com. Does this mean “choose spain” or “chooses pain”? Based on the frequency of the these two terms coming together, in the collection of phrases at Google, it will be possible to say with reasonable certainity that the phrase is “chooose spain” as the number of occurences of the phrase “choose spain” is 3120 whereas there no occurences of the phrase “chooses pain” in the collection of phrases.
But if one considers a phrase like “insufficientnumbers” it becomes difficult to determine if this means “in sufficient numbers” or insufficient numbers”. A human eye may be able to make it out based on the context, but for a computer to determine this will become difficult especially when one knows that the number of occurences of “in sufficient numbers” is 32378 and occurences of “insufficient numbers” is 20751.
The second topic that the author talks about is how to use this data to decipher secret codes and the third topic that is discussed is correction of spellings. The author mentions that the other interesting applications would be
1. Language Identification
2. Spam Detection and Other Classification Tasks
3. Author Identification
4. Document Unshredding and DNA Sequencing
5. Machine Translation

Chapter 15 – Life in Data: The story of DNA
In this chapter the author describes how the billions and billions of possible sequencing of DNA was achieved.

Chapter 16 – Beautifying Data in the Real World
This chapter starts off with the premise that there are two fundamental problems with collecting “beautiful data”
1. Universe is inherently noisy and so one will tend to get different readings of the same experiment given slightly varying circumstances.
2. The second problem is of space limitations. Raw data based on which conclusions are drawn tend to be too large. The question that arises is how does one present the raw data so that somebody looking at the conclusions drawn can validate the relevance of the conclusion?

The author goes on to describe different techniques were used to minimize these problems in collection of data with respect to chemical behaviour of the different chemicals.

Chapter 17 – Superficial Data Analysis: Exploring Millions of Social Stereotypes
This talks about the site facestat.com. This site allows users to post their photos and ask for the world to comment on their appearance. When sufficient data was collected this was used to analyze stereotypes. One of the standout observation was the people tended to rate women as more beautiful than men. The majority of children were considered to be cute or beautiful as opposed to any other age group.

Chapter 18 – Bay Area Blues: The effect of housing crisis
This chapter talks about how the data of sale of houses in California over the a period of 2000 to 2010 was analyzed to try and figure out the impact of the housing loan crisis on the price of the houses. The data was analyzed for various perspectives and it provided different insights.

Chapter 19 – Beautiful Polical Data
This chapter explains how election data from different elections in the US was taken and analyzed to yield some very interesting insights into how the people tended to vote.

Chapter 20 – Connecting Data
This chapter talks about how similar/same data obtained from different sources can be linked to find the right matches. The chapter talks about how data stored in a graph form can be identified to be same by trying to approach the data from different directions. E.g. to identify two movies are same one can try to reach the movie through the actors, through the director, through year of release and if all of them/or majority of them lead to two different nodes from different sources then it is very

View all my reviews

Ranbaxy the onetime epitome of Indian entrepreneurship has been detected to have committed fraud to get license to manufacture generic drugs. Read http://features.blogs.fortune.cnn.com/2013/05/15/ranbaxy-fraud-lipitor/.

Is it surprising that this has happened? No, not at all. Corruption, cooking up figures, cooking up documentation, having special coaching and preparation purely from perspective of clearing an examination are all some form of corrupt practices that are prevalent in India.

A similar situation exists in the software industry in India. What we produce is Working Software, not Quality Software. Most software projects that we execute go through a crisis phase when some hero has to come and rescue the project. This happens after the client escalates the matter to the higher management in the organization and the higher management leashes out its whips and then the right people start coming in and fixing the problem.

The reason for this problem plaguing the software industry is manifold. It starts with the Response to the Request For Proposal (RFP) from the client. Typically the RFP is a very high level requirement from the client. In addition to the few lines of requirements the client also typically mentions the timeline by which they wish to have the project ready for deployment. In the eagerness to win the project the software contractors squeeze as much as possible, make all impossible assumptions, assumptions of both requirements and the ability of the team to develop and deliver the project.

Finally it comes down to price negotiation. The client presses the already squeezed out contractor of further juice. The sales person from the contractor in front of the client agrees to the cuts, possibly in the cost of the project and the to the timelines within which it has to be delivery while at the same time they agree to increased requirements. The sales person does this as she has to meet her sales target and then she is absolved of all other responsibilities.

Once the project comes to the delivery team whose target is to deliver the project in the minimum time with the maximum profit, this team starts cutting down on costs by getting the cheapest resource available. These are capable resource with no experience. This teams starts possibly gathering requirements during which time the actual enormity of the task strikes them and they try to negotiate to bring down the scope. In most cases it is a worthless exercise. At the end of the exercise the requirements are still unclear and many loose ends are left dangling as the team that would have gathered the requirement would have been a novice one and the team that would have specified the requirements would not have know what details need to be specified for this team to design and develop the system.

Now when the design and coding starts in earnest, two problems begin to show; first there would be very few team members who will be really capable of putting together a good design; the second problem will be that there will be even more few team members who are capable of writing good code. Added to this mixture the fact that the requirements were not clear means that the this team starts making more assumptions. The tendency to go back to the client to get clarifications is rare under these circumstances as the leader would say “If you go to them you will get more requirements, so assume and proceed so that we deliver on time”.

Soon the time for delivery will pass and the client will start raising heckles on why the project is getting delayed. One would go through a series of negotiations on how the requirements were unclear and how the client team did not support the development etc and the client would keep telling how incompetent the team developing is. It many cases it will go to the higher management which will send in a Process Team to assess the processes being followed by the team. In most cases it will turn out that the documentation is not in place or is not complete and the process team will now point to this gap as the reason for the delay and now the project team will need to fill in the gaps by creating these documents delaying the project further. The next attempt will be get more people into the project to try and “speed up” the delivery. Soon without many noticing the profit that were planned would have started eroding.

Finally at some point in the development cycle, when the pressure mounts beyond tolerance, the develop team will dump a whole lot of code at the Quality Assurance (QA) team. This team would be equally at loss about the functionality and will go about testing a the application and raise some errors like font is mismatched or message is grammatically incorrect and few such trivial errors. Some of these would be fixed and some would be rejected by the team explaining that it is a feature.

Once it goes past the QA team the application would be delivered to the client for User Acceptance Testing. This team would typically be expected to carry out this exercise in addition to carrying on their routing job. As a result this testing would be given a step motherly treatment and after reporting a few errors this team will go back to doing their routine jobs. They would most likely raise one or two “errors” which would be disputed as “change requests” by the contractor and negotiations would be on to come to a compromise.

The development team will try and fix these problems and make another release going through the same cycle of testing by QA. QA team will typically test the application only for the errors that were raised at UAT and as a result when it goes to the user group for the next round of UAT there would be other errors which would have crept into the system due to these bug fixes and some of these would typically show up in the UAT.

This cycle will repeat a few times before the management gets a tired of the process and charges of incompetency will be traded between the contractor and the client, with the client being more accusatory than the contractor due to the nature of the relationship between them.

Now the management from the contractor will try and send in more senior designers and developers who will come a work “miracles” and get the system functioning much better. This would still not fix all the problems and issues but in all likelihood the project will end up being deployed in production.

Within a few days of it in production one will start seeing functional issues, performance issues, and availability issues with the system. The development team, ill equipped and over burdened will try and resolve the issues and will end up resolving some and bringing in more issues in the process.

This cycle will continue for a few iterations before the client reaches her wits end and escalates the matter to the senior management. Now another set of senior designers and architect will come in and start looking into the system. This team will in all likelihood have a supercilious attitude towards the developer community and will go about berating the entire development team. They will try and fix up a few things and get things working. After a few iterations the system will be functional. It will still not be efficient or error free or perform as good as it could have performed.

If one analyzes the above situation and the situation at Ranbaxy or any other incidents such as constructions breaking down because of bad construction or poor material being used, or cases like the “housing loan schemes” which led to a global meltdown of economy the one conclusion that can be drawn without much hesitation is that these happens due to greed for money. It is not that many of these people do not have money, it is just that one can NEVER HAVE ENOUGH OF MONEY.

1. Thou shalt not copy-paste code
2. Thou shalt name appropriately even if it means typing long names.
3. Thou shalt write and automate Unit Test Cases
4. Thou shalt write small methods and small classes. (#)
5. Thou shalt document, not how it works, not what it does($), but how it can be used and why it exists
6. Thou shalt have only two levels of indentation in any method.
7. Thou shalt not write god classes (*).
8. Thou shalt update your, logically complete, code changes at least one a day to the version control system.
9. Thou shalt update your development environment at least once from the version control system.
10. Thou shalt be humble enough to accept errors or inefficiencies in your code as pointed out by your reviewers, peers or juniors and be open to correct them.
Notes
(#) Because all good things come in small packages.
($) What it does or contains should be known from the name.
(*) God classes are classes that do too many things. Whole application depends on a few classes.

1. Thou shalt configure static code checking, run it, and fix violations that are raised by the tool.
2. Thou shalt version your code in the Version Control System.
3. Thou shalt adopt Continuous Integration. (*)
4. Thou shalt let the end user to use the system regularly during development to get her feedback.
5. Thou shalt make an effort and understand the business process of the domain.
6. Thou shalt discuss with client the requirements in detail and if required ask, what may be perceived to be stupid/obvious questions, rather than assuming.
7. Thou shalt document all assumptions and share and discuss with the client the assumptions.
8. Thou shalt not drive the development team to work more than 10 hours or make it work on Saturdays and holidays.
9. Thou shalt not add more developers to try and finish the project earlier. (#)
10. Thous shalt not skip testing and performance testing of the system before any release.
Note
(*) Check out code from the Version Control, Compile the code, Run static Code checking, run the Automated Unit Tests and generate reports.
(#) Read mythical man month by Fred Brooks

1. Thou shalt design Interfaces and design to Interfaces.
2. Thou shalt use the easiest framework/architecture in design and not be be enamored and use complex/latest framework without justification.
3. Thou shalt use composition over inheritance. (*)
4. Thou shalt keep the design simple.
5. Thou shalt design for security from day one, without over engineering.
6. Thou shalt design for performance from day one, without over engineering.
7. Thou shalt design for usability from day one, without over engineering.
8. Thou shalt discuss with client and try to convince her to water down the requirement before accepting requirements which demand complex engineering.
9. Thou shalt design application to be independent of platform, container, database and any third party components as far as possible.
10. Thou shalt document all architectural options considered and rationale for selection of the architecture.
Notes
(*) It is better to earn you stripes rather than to inherit your stripes

JavaScript – The Good Parts – Part 7

This is the last and final post based on the book JavaScript – The Good Parts by Douglas Crockford. In this post, as in the previous post, we will continue with some more methods in the pre-defined objects of JavaScript.

RegExp

regexp.exec(string) The exec is the most powerful (and the slowest) of all methods that use regular expression. If it successfully matches the regexp and the string, it returns an array. The “0” element of the array will contain the substring that matched the regexp. The 1 element is the text captured by the group 1, the 2 element is the text captured by group 2, and so on. If the match fails it will return a null.
If the regexp has a “g” flag, things are a little complicated. The searching begins not at the position 0, but at the position regexp.lastIndex (which is initially zero). If the match is successful then the regexp.lastIndex will be set to the position of the first character after the match. An unsuccessful match resets the regexp.lastIndex to 0.

regexp.test(string) This is the simplest and the fastest of the methods that use regular expression. If the regexp matches the string it returns true, otherwise it returns false. Do not use “g” flat with this method.
var b = /&.+;/.test(‘frank & beans’); //b is true.

String

string.charAt(pos) This returns the character at the position “pos” in this string. If pos is less than zero or greater than or equal to the length of the string, it returns an empty string.
var name = ‘Curly’;
var initial = name.chartAt(0); //initial is “C”

string.charCodeAt(pos) This returns an integer representation of the character at position “pos” instead of the character. If pos is less than zero or greater than or equal to the length of the string, it returns a NaN.
var name = “Curly”;
var initial = name.chartCodeAt(0); //initial is “67”

string.concat(string …) This method makes a new string by concatenating the other strings together.
var s = ‘C’.concat(‘a’, ‘t’); //value of c is “Cat”

string.indexOf(searchString, position) This searches for the “searchString” from “position” in the string. This returns a -1 if this string is not found, else it returns the index at which the searchString is found.
var text = “Mississippi”;
var p = text.indexOf(‘ss’); //p is 2
p = text.indexOf(‘ss’, 3); //p is 5
p = text.indexOf(‘ss’, 6); //p is -1

string.lastIndexOf(searchString, position) This searches for the “searchString” from “position” in the string but in the reverse direction. This returns a -1 if this string is not found, else it returns the index at which the searchString is found.
var text = “Mississippi”;
var p = text.lastIndexOf(‘ss’); //p is 5
p = text.lastIndexOf(‘ss’, 3); //p is 2
p = text.indexOf(‘ss’, 6); //p is 5

string.localeCompare(that) This compares two string and returns a -1 if the “this” string is smaller than the second, 0 if they are equal and 1 if the “this” string is greater than the second.
var m = [‘AAA’, ‘A’, ‘aa’, ‘a’, ‘Aa’, ‘aaa’];
m.sort(function (a, b) {
  return a.compareLocale(b);
});

//m in some locale after sorting is [‘a’, ‘A’, ‘aa’, ‘Aa’, ‘aaa’,  ‘AAA’];

string.match(regexp) This matches a string and a regular expression. How it does depends on the “g” flag. If no “g” flag is specified then the result of calling string.match(regexp) is same as calling regexp.exec(string). However if the regexp has the “g” flag, then it produces an array of all the matches, but excludes capturing groups.

string.replace(searchValue, replaceValue) The replace method searches the string for occurrence of “searchValue” and replaces them with “replaceValue”. The “searchValue” can be a regular expression object. If it is a string then only the first occurrence of searchValue is replaced.
If the searchValue is a regular expression and it has the “g” flag then all the occurrences will be replaced. If it does not have a “g” flag then only the first occurrence will be replaced.
The replaceValue is a string then the character $ has a special meaning.
//Capture 3 digits within parenthesis
var oldareacode = /\((\d{3})\)/g;
var p = ‘(555)666-1212’.replace(oldareacode, ‘$1-‘); //p is 555-666-1212

Dollar Sequence Replacement
$$ $
$& The matched text
$number Capture group text
$` The text preceding the match
$’ The text following the match

If the replaceValue is a function, it will be called for each match, and the string returned by the function will be used as the replacement text. The first parameter passed to the function is the matched text. The second parameter is the text of the capture group 1, the next parameter is of group 2 and so on.

string.search (regexp) This is like the indexOf method except that it takes a regexp. If “g” parameter, if passed is ignored.

string.slice(start, end) This is like the array slice. It returns the string value from the start to the end -1 position. If the start parameter is negative, it adds string.length to it.  If no end parameter is specified then it is taken to be string.length. If the end parameter is negative it adds the string.length to it.
var text = ‘and in it he says “Any damn fool could’;
var a = text.slice(18); //a is ‘”Any damn fool could’
var b = text.slice(0, 3); //b is ‘Any’
var c = text.splice(-5); //c is ‘could’
var d = text.slice(19,32); //d is ‘Any damn fool’
var e = text.slice(19, -5); //d is ‘Any damn fool ‘

string.split(separator, limit) The split method creates an array of strings by splitting this string into pieces. The optional limit parameter can limit the number of pieces that will be split. The separator parameter can be a string or a regular expression.
If the separator is the empty string, an array of single characters is produced.
var digits = ‘0123456789’;
var a = digits.split(”, 5); //a is [‘0’, ‘1’, ‘2’, ‘3’, ‘456789’];
Otherwise the string is searched for all occurrences of the separator. Each unit of text between separators is copied into the array. The “g” flag is ignored.
var ip = “192.168.32.45”;
var b = ip.split(‘.’); //b is [‘192’, ‘168’, ’32’, ’45’];
var c = ‘|a|b|c|’.split(‘|’); //c is [”, ‘a’, ‘b’. ‘c’, ”];

string.substring(start, end) This is same as the slice method except it does not adjust for negative values.

string.toLocaleLowerCase() Converts the string to a lower case based on the locale.

string.toLocaleUpperCase() Converts the string to an upper case based on the locale.

string.toLowerCase() Same as toLocaleLowerCase() but without considering the locale.

string.toUpperCase() Same as toLocaleUpperCase() but without considering the locale.

string.fromCharCode(char …) The converts a series of numbers to a string.
var a = String.fromCharCode(67, 97, 116); // a is ‘Cat’

Conclusion JavaScript is a very powerful language and it has its good parts. It also has its dark side, like the global variables, but if one takes the trouble to learn and understand the language then one can work wonders with this language.

To end the series here is a series of blogposts by Venkat Subramanian which will serve as a very good introduction to functional programming for the uninitiated.
http://blog.agiledeveloper.com/2012/11/functional-style-of-programming-in_14.html
http://blog.agiledeveloper.com/2012/11/functional-style-of-programming-in_15.html
http://blog.agiledeveloper.com/2012/11/functional-style-of-programming-in_16.html
http://blog.agiledeveloper.com/2012/11/functional-style-of-programming-in.html

This is the penultimate posting in this series of posts based on the book JavaScript – The Good Parts by Douglas Crockford.
The previous post covered how to “curry” functions and some JavaScript gotchas.
This post gives a list of pre-built methods in the standard objects pre-defined in JavaScript. Many of these would be known only to the hardcore users of JavaScripts and this is serve as an eye-opener to the casual users of JavaScript.

Array
array.concat(items)
This method appends the items to the base array. If the items itself is an array that elements of that array are copied to that of the base array.
E.g.
var a = [‘a’, ‘b’, ‘c’];
var b = [‘d’, ‘e’, ‘f’];
var c = a.concat(b, true);
//c is [‘a’, ‘b’, ‘c’, ‘d’, ‘e’, ‘f’, true];

array.join(separator)
This joins the elements in an array separated by the separator and gives a String.
var a = [‘a’, ‘b’, ‘c’];
var c = a.join(“/”);  //c is ‘a/b/c/’

array.pop
This removes the last element from the array and returns this as the value. The size of array reduces by one.
var a = [‘a’, ‘b’, ‘c’];
var c = a.pop();
//c = ‘c’ and a =[‘a’, ‘b’];

array.push(item …)
This adds the object to the end of the array.
var a = [‘a’, ‘b’, ‘c’];
var b = [‘d’, ‘e’, ‘f’];
var c = a.push(b, true);
//c is [‘a’, ‘b’, ‘c’, [‘d’, ‘e’, ‘f’], true];
Note how this is different from concat. The array “b” has been embedded at index 3 of the array “a”. It has not copied the contents of array “b” to “a”.

array.reverse()
This reverses the values in the array.
var a = [‘a’, ‘b’, ‘c’];
var b = a.reverse();
//both a and b will be [‘c’, ‘b’, ‘a’]

array.shift()
The shift method removes the first element from an array and returns it. If the array is empty it returns undefined.
var a = [‘a’, ‘b’, ‘c’];
var b = a.shift();
//b is ‘a’ and a is [‘b’,’c’]

array.slice(start, end)
This is like substring. Where start is the start position and end is the last but one position that should be copied. If none is specified then it is copied to the end of the array.
var a = [‘a’, ‘b’, ‘c’];
var b = a.slice(0,1); //b will be [‘a’];
var c= a.slice(1); //c will be [‘b’, ‘c’]
var d = a.slice(1, 2); //d will be [‘b’]

array.sort(comparefn)
This sorts the array assuming that all the elements are string. If one needs to sort based on some other datatype one can pass a compare function. This function will be passed the two consecutive values and it should return a “-1” if the first is less than the second a “0” if they are equal and a “1” if the first is greater than the second for sorting the elements in an ascending fashion.

array.splice(start, deleteCount, item, …)
This method starts deleting “deleteCount” elements from the “start” position and inserts the values in item into this gap.

array.unshift(item …)
The unshift method is like the push method, except that it pushes the items onto the front of the array instead of at the end. It returns the array’s new length.
var a = [‘a’, ‘b’, ‘c’];
var r = a.unshift(‘?’, ‘@’);
//a is [‘?’, ‘@’ ,‘a’, ‘b’, ‘c’];
//r is 5

Function
function.apply(thisArg, arguments)
This executes the function with “thisArg” treated as the “this” for the function and the “arguments” is passed as the “arguments” to the function.

Number
number.toExponential(fractionDigits)
This converts this number to a string in the exponential form. The optional fractionalDigits controls the number of decimal places. This should be between 0 and 20.
Math.PI.toExponential(0) = 3e+0
Math.PI.toExponential(2) = 3.14e+0
Math.PI.toExponential(7) = 3.1415927e+0
Math.PI.toExponential(16) = 3.1415926535897930e+0
Math.PI.toExponential() = 3.141592653589793e+0

number.toFixed(fractionDigits)
This method converts this number to a string in the decimal form. The optional fractionDigits parameter controls the number of decimal places. It should be between 0 and 20. The default is 0.
Math.PI.toFixed(0) = 3
Math.PI.toFixed(2) = 3.14
Math.PI.toFixed(7) = 3.1415927
Math.PI.toFixed(16) = 3.1415926535897930
Math.PI.toFixed() = 3

number.toPrecision(precision) This converts the number to a string in decimal form. The optional precision parameter controls the number of digits of precision. It should be between 1 and 21.
Math.PI.toPrecision(2) = 3.1
Math.PI.toPrecision(7) = 3.141593
Math.PI.toPrecision(16) = 3.141592653589793
Math.PI.toPrecision() = 3.141592653589793

number.toString(radix) This converts the number to string. The optional radix parameter controls the radix or the base. This should be between 2 and 36. The default radix base is 10.
Math.PI.toString(2) = 11.001001000011111101101010100010001000010110100011
Math.PI.toString(7) = 3.1103755242102643
Math.PI.toString(16) = 3.243f6a8885a3
Math.PI.toString() = 3.141592653589793

Object
object.hasOwnProperty(name)
This indicates if the property by the specified “name” belongs the object or is it derived from the one of the prototype objects in the hierarchy.
var a = {member: true};
var b = Object.beget(a); //b uses a as the prototype
var t = a.hasOwnProperty(‘member’); //”t” is true
var f = b.hasOwnProperty(‘member’); //”f” is false
var v = b.member; //“v” is true.


Categories