Sundarrajk's Weblog

Archive for the ‘Software’ Category

Debug It!: Find, Repair, and Prevent Bugs in Your CodeDebug It!: Find, Repair, and Prevent Bugs in Your Code by Paul Butcher
My rating: 3 of 5 stars

In the book the author covers different aspects of

Effective Debugging

In the first chapter the author advises1. Work out why the software is behaving unexpectedly.
2. Fix the problem.
3. Avoid breaking anything else.
4. Maintain or improve the overall quality (readability, architecture, test coverage, performance, and so on) of the code.
5. Ensure that the same problem does not occur elsewhere and cannot occur again.

The author emphasizes that without first understanding the true root cause of the bug, we are outside the realms of software engineering and delving instead into voodoo programming or programming by coincidence.

He suggests that empirical means is the best way to Debug i.e. provide different inputs and observe how the system behaves.

The Core Debugging Process involves the following steps:
1. Reproduce: Find a way to reliably and conveniently reproduce the problem on demand.
2. Diagnose: Construct hypotheses, and test them by performing experiments until you are confident that you have identified the underlying cause of the bug.
3. Fix: Design and implement changes that fix the problem, avoid introducing regressions, and maintain or improve the overall quality of the software.
4. Reflect: Learn the lessons of the bug. Where did things go wrong? Are there any other examples of the same problem that will also need fixing? What can you do to ensure that the same problem doesn’t happen again?

Address one Bug at a time: Picking too many bugs to address at one time will prevent focus on one.
Check Simple Things first: Somebody may have encountered something similar and may already have a solution.

Reproduce

1. Reproduction of the error should consistent and efficient, otherwise testing the fixing will become a botheration.
2. So reproduce the error in an controlled the environment to achieve consistency.
3. To keep it efficient try and reduce the input to be provided and reduce the processing that needs to be done, store the state at every step so that only the errorneous step needs to be rerun.
4. Automate the test conditions to make it quicker and easier to test the application after the fix. Replaying the log file can be a good strategy in scenarios where logging using proxy was used to capture the error condition.

The following will help in reproducing the error:
1. Logging at appropriate places so that one knows what is happening in the system. Too much logging will be unacceptable in a production system.
2. Where possible usage of a proxy to capture the network traffic and try to reproduce the error with this traffic.
3. If calls to libraries are problematic or they need to be emulated in a test environment, write a Shim (a proxy to a library) and capture the inputs and outputs and use this to reproduce the error. In engineering, a shim is a thin piece of material used to fill the space between objects. In computing we’ve borrowed the term to mean a small library that sits between a larger library and its client code. It can be used to convert one API to another or, as in the case we’re discussing here, to add a small amount of functionality without having to modify the main library itself.
4. Reach out the user community that is able to reproduce the error and get inputs from them. Give them specially instrumented code to figure out the error.
5. Read the documentation on the system, if the problem seems to be occurring beyond the realms of the code that has been written, and read the errors reported by others using the same platform.

Irreproducible Errors

Most bugs are reproducible. The few scenarios where the bug may be irreproducible or difficult to reproduce will be because of the following reasons:
1. Starting from an unpredictable initial state: C, C++ programs are prone to this error.
2. Interaction with external systems: This can happen if the other system is no running in lock-step with this software. If inputs from this external systems arrives when the current system is under different states the error can be difficult to reproduce.
3. Deliberate Randomness: In some systems there is deliberate randomness as in games. These can be difficult to debug. But if the same seed is used for the pseudo-random number generator then the bug will become easier to reproduce.
4. Multithreading – This happens because of the pre-emptive multi-tasking provide by the Operating System. Since the threads can be stalled and restarted at different times depending on the activity in the CPUs at that that time, it becomes difficult to reproduce errors in such an environment. Trying using the sleep to try and simulate the stalling of one thread and execution of another to try and emulate the error.

Good Practices of Reproducing

If a bug takes a long time and is still not identified this may be because another bug is masking this one. So try to concentrate on a different bug in the same area and possibly clear it before retrying the difficult one.

Diagnosis

How to Diagnose?

1. Examine what you know about the software’s behaviour, and construct a hypothesis about what might cause it.
2. Design an experiment that will allow you to test its truth (or otherwise).
3. If the experiment disproves your hypothesis, come up with a new one, and start again.
4. If it supports your hypothesis, keep coming up with experiments until you have either disproved it or reached a high enough level of certainty to consider it proven.

Techniques of Diagnosing

1. Instrument the code to understand the flow better.
2. Use a binary search pattern and logging to locate the source code of error. I.e. look for error before and after the execution of a stretch of code. If error is found now look for the error in the first half of this code stretch, if not found then look for the error in the second half of the stretch; then further split the stretch found into further two halves; repeat this until the exact point of error is found.
3. Use a binary search pattern in version control to identify the version when error was introduced.
4. Use a binary search pattern on data to identify the version of the error.
5. Focus on the differences. The Application works for most customers, but not for specific ones. Check how these customers are different from the rest where the application is working. Similarly works in most environments, but does not work in a particular environment. Try and figure out what is different in that environment. If it happens for specific input files then figure out what is different in that file as compared to other files where it works.
6. Use debuggers when available.
7. Use the Interactive Consoles where debuggers are not available or are not good.

Good Practices of Diagnosing

1. When experimenting make only one change at a time.
2. Ignore nothing. Do not shrug off the unexpected as and anomaly. It could be that our assumptions are wrong.
3. Maintain a record of experiments and results so that it is easy to trace back.
4. Anything that you don’t understand is potentially a bug.
5. Learn from others. Search in the net for similar problem and solution offered.
6. All other things being equal, the simplest explanation is the best. – Occam’s Razor
7. Writing Automated Test Cases helps because this lets us concentrate only on broker cases.
8. Keep Asking “Are you changing the right thing?” If the changes you’re making have no effect, you’re not changing what you think you are.
9. Validate and revalidate your assumptions
10. Ensure that the underlying system on which diagnosis is being done is static and not changing.
11. If one is stuck in debugging a problem, one good way is to ask somebody else to take a look at it.

Fixing

Best Practices

1. Make sure you know how you’re going to test it before designing your fix.
2. Do not let the fixes mess up with the original clean design and structure of code. Haphazardly put together fixes can mess up the good design principles followed in the original design. Any fix should leave the code in better shape than it was before.
3. Clean up any adhoc code changed before making the final fix so that no unwanted code gets checked in. Keep only what is absolutely necessary.
4. Use existing test cases. Modify the test cases if required or write the failing test case and test code without the fix. Then fix the code and test the failing test case to see that it passes after the fix.
1. Run the existing tests, and demonstrate that they pass.
2. Add one or more new tests, or fix the existing tests, to demonstrate the bug (in other words, to fail).
3. Fix the bug.
4. Demonstrate that your fix works (the failing tests no longer fail).
5. Demonstrate that you haven’t introduced any regressions (none of the tests that previously passed now fail).
5. Fix the Root Cause not the symptom. E.g. if one encounters a NullPointerException, the solution is not to capture the NullPointerException and handle or even worse suppress it, it is necessary to figure out why the NullPointerException is occurring and fixing that cause. Giving into temptation of quick fixes is not the right thing, making the right fix is the right thing.
6. Refactor or change functionality or fix a problem — one or the other, never more than one.
7. Always check in small changes. Do not check in large changes as it will make it very difficult to find out which change actually caused the problem. Ensure check-in comments are as meaningful (and specific) as possible.
8. Diff and check what exactly is being checked in before actually checking in.
9. Get the code reviewed. This is very important as unnoticed errors

After Fixing – Reflect

Sometimes “The six stages of debugging” reads as follows:
1. That can’t happen.
2. That doesn’t happen on my machine.
3. That shouldn’t happen.
4. Why is that happening?
5. Oh, I see.
6. How did that ever work?

After fixing one needs to reflect on the following points:
• How did it ever work?
• When and why did the problem slip through the cracks?
• How to ensure that the problem never happens again?

Find out the root cause. A useful trick when performing root cause analysis is to ask
“Why?” five times. For example:
• The software crashed. Why?
• The code didn’t handle network failure during data transmission. Why?
• There was no unit test to check for network failure. Why?
• The original developer wasn’t aware that he should create such a test. Why?
• None of our unit tests check for network failure. Why?
• We failed to take network failure into account in the original design.

After fixing do the following:
1. Take steps to ensure that it does not ever happen again. Educate yourself, educate others on the team.
2. Check if there are other similar errors.
3. Check if the documentation needs to be updated as a result of the fix.

Other aspects of handling and managing bugs

1. To better aid debugging collect relevant environment and configuration information automatically.
2. Detect bugs early, and do so from day one.
3. Poor quality is contagious. Broken Window concept. The theory was introduced in a 1982 article by social scientists James Q. Wilson and George L. Kelling. So do not leave bad code. Fix bad code at the earliest.
4. A Zero Bug Software is impossible, so take a pragmatic approach and try to reach as close to Zero bugs as possible. Temper perfectionism with pragmatism.
5. Keep the design simple. Not only does a simple design make your software easier to understand and less likely to contain bugs in the first place, it also makes it easier to control—which is particularly useful when trying to reproduce problems in concurrent software.
6. Automate your entire build process, from start to finish.
7. Version management of code is absolutely mandatory.
8. Different source should mean different version number. Even if the change to the code is minuscule.

Some Excerpts from the Book

View all my reviews

A few weeks ago a blog was published asking a rhetoric question “Can software be created in factories?“. My good friend pointed me to the wikipedia post on “Software Factory“. What I would like to point out is that the statement “Software factory refers to a structured collection of related software assets that aids in producing computer software applications or software components according to specific, externally defined end-user requirements through an assembly process. [1] A software factory applies manufacturing techniques and principles to software development to mimic the benefits of traditional manufacturing. Software factories are generally involved with outsourced software creation.” from the wikipedia is completely incorrect and fallacious, despite the number of individuals who believe it.

Martin Fowler’s bliki, Code as Documentation, has a link to Jack Reeve’s famous essay “What is Software Design?”.

This article first appeared in 1992 in the C++ Journal. It was written by Jack Reeve who had been in the industry for more than 10 years at the time and the trigger was the fact that C++ had taking the software world by storm. It was being seen as the panacea for all the problems plaguing the software industry during that time.

He summarizes the article as follows:

To summarize:

  • Real software runs on computers. It is a sequence of ones and zeros that is stored on some magnetic media. It is not a program listing in C++ (or any other programming language).
  • A program listing is a document that represents a software design. Compilers and linkers actually build software designs.
  • Real software is incredibly cheap to build, and getting cheaper all the time as computers get faster.
  • Real software is incredibly expensive to design. This is true because software is incredibly complex and because practically all the steps of a software project are part of the design process.
  • Programming is a design activity—a good software design process recognizes this and does not hesitate to code when coding makes sense.
  • Coding actually makes sense more often than believed. Often the process of rendering the design in code will reveal oversights and the need for additional design effort. The earlier this occurs, the better the design will be.
  • Since software is so cheap to build, formal engineering validation methods are not of much use in real world software development. It is easier and cheaper to just build the design and test it than to try to prove it.
  • Testing and debugging are design activities—they are the software equivalent of the design validation and refinement processes of other engineering disciplines. A good software design process recognizes this and does not try to short change the steps.
  • There are other design activities—call them top level design, module design, structural design, architectural design, or whatever. A good software design process recognizes this and deliberately includes the steps.
  • All design activities interact. A good software design process recognizes this and allows the design to change, sometimes radically, as various design steps reveal the need.
  • Many different software design notations are potentially useful—as auxiliary documentation and as tools to help facilitate the design process. They are not a software design.
  • Software development is still more a craft than an engineering discipline. This is primarily because of a lack of rigor in the critical processes of validating and improving a design.
  • Ultimately, real advances in software development depend upon advances in programming techniques, which in turn mean advances in programming languages. C++ is such an advance. It has exploded in popularity because it is a mainstream programming language that directly supports better software design.
  • C++ is a step in the right direction, but still more advances are needed.

The points to note with respect to the factory aspect of software are highlighted in red. Note that the author states the coding is design and one cannot dispute this fact. An since one does not design in a factory software development cannot be considered to happen in a factory. It may look like splitting hairs, but for somebody who is coding, be it a novice who has started yesterday, or be it somebody who has been doing it for donkey’s years it is apparent that this is indeed a fact. One does keep designing practically with every line of code.

Another interesting excerpt is “In software engineering, we desperately need good design at all levels. In particular, we need good top level design. The better the early design, the easier detailed design will be. Designers should use anything that helps. Structure charts, Booch diagrams, state tables, PDL, etc.—if it helps, then use it.”

This was the statement of the author in the essay published in 1992. Writing about this in 2005 the author says “Today, I would phrase it differently. I would say we need good architectures (top level design), good abstractions (class design), and good implementations (low level design). I would also say something about using UML diagrams or CRC cards to explore alternatives.”
This is what the author is referring to from the earlier article: “We must keep in mind, however, that these tools and notations are not a software design. Eventually, we have to create the real software design, and it will be in some programming language. Therefore, we should not be afraid to code our designs as we derive them.”

The author goes on to say “This is fundamental. I am not arguing that we should not “do design.” However you want to approach the process, I simply insist that you have not completed the process until you have written and tested the code.”

Note that the author bolsters the argument that software development involves design at all stage. It is not limited to a single design phase.

Another interesting statement in the second essay is “When the document is detailed enough, complete enough, and unambiguous enough that it can be interpreted mechanistically, whether by a computer or by an assembly line worker, then you have a design document. If it still requires creative human interpretation, then you don’t.”. Again goes on to prove that software cannot be created in factories.
One final argument to support that Software cannot be created in factory “The problem with software is – design is not just important, it is basically everything. Saying that programmers should not have to design is like saying fish should not have to swim. When I am coding, I am designing. I am creating a software design out of the void.”

Update
When this was sent to a few people the reply I got back is “In the Indian IT industry, there is no such thing as a “Less Able Programmer”. All donkeys can be “processed” to become a stallion. all crows can become swans…” and all that can be said about this sad fact is that “This belief is exactly the bane of the Indian IT industry and in my, black, cynical, negative opinion is going to lead to the downfall of the what we today consider to be a cash cow.”

In my interactions with various personnel working in the IT world I have noticed that some people have the tendency to use the word “factory” to describe location where people are either writing new applications or are maintaining existing software. Something in the word “factory” raises an irritation in me. I do not get a comfortable feeling when somebody equates to software development/maintenance to the tasks performed in a factory.

What, I think, these people fail to realize or admit is that in factory the tasks tend to be repetitive and hence “teachable” and “learnable”. This is the reason why we see so much automation in all the factories, they hardly have any human intervention.

Unlike the manufacturing factories the software “factory” is full of people. Except for very few processes in the development cycle, software creation cannot be automated. Human intervention is required at almost every stage. Software requires human touch during creation.

From a maintenance and support perspective too software needs humans to address any issues that come up in the production. Very little of this can be automated.

Given these it gives me creeps if somebody refers to “software factory”.

See http://architects.dzone.com/articles/every-programmer-should-know for a comparison of latencies when accessing different parts of the system. This should give an idea of what strategy should be adopted to get better performance.

But even after looking at the latencies people wish to go with “Web Services” even for applications which are hosted in the same data center and have been developed by the same team and in the same technology, just because it is the “in thing”.

To use web services one needs make the following conversion
1. On the client before making the request convert the native format to strings to form the XML to be sent.
2. On the server on receiving the XML request convert to native format for better processing
3. On the server after getting the results convert the native back to string so that it can be sent over XML
4. On the client after receiving convert the string in the XML to native format for consumption.

The steps 1 and step 4 may not be mandatory in some scenarios. Steps 2 and 3 can almost not be avoided. These steps need to be executed irrespective of the protocol that is being used. These conversions are expensive and should be avoided when one is writing applications in the same technology and more so when the applications are hosted in the same data center (read LAN)

In an extreme, one of the projects had the following data flow:
1. User from the Desktop Browser makes a request
2. The Application Server converts this into a Web service request and posts it to another server.
3. This Web Service server makes an EJB call to another server.
4. The EJB in turn invokes one or more stored procedures which actually have the business logic, data access and everything else.
5. The data is obviously returned through the same path.
There is no technology other than Java involved in any of the layers above and they are all hosted in the same data center.

Now the same functionality over a mobile has been implemented as follows:
1. User from the Mobile Application makes a request
2. This is sent to the server as a REST request.
3. This REST request from mobile is converted and another internal REST request is made.
4. This internal REST request is converted into a Web service request and posted to another server.
5. This Web Service Server makes an EJB call to another server.
6. The EJB in turn invokes one or more stored procedures which actually have the business logic, data access and everything else.
7. The data is obviously returned through the same path.

Is it any surprise that this application is facing performance, stability, availability and reliability issues?


SQL Performance – 101

1.       Fetch only the required columns. Do not use select *. Do not fetch the columns from the database and ignore them in the code.
a.       This is to minimize the network traffic between the database server and the server on which the data is being processed.
b.      It will reduce the memory utilization on the database server and on the server where the results are being processed.
2.       Fetch only the required rows. Use the proper “where” clause to limit the number of rows retrieved. Do not filter out records in the code.
a.       This will reduce the disk I/O in the database server.
b.      This will minimize the network traffic between the database server and the server on which the data is being processed.
c.       This will reduce the memory utilization on the database server and on the server where the results are being processed.
d.      Use the “LIMIT”/”TOP” clause as available in the database to limit the number of rows retrieved. These can come very handy for paginating data on the screen. This can come in handy when working with export of large amounts of data too.
3.       Join only minimum required tables. As a thumb rule do not join more than three tables.
a.       The joining of any two tables is typically a Cartesian product of the two tables. This means that if we join two tables which have 10000 records each at some stage the database will be dealing with 100,000,000 i.e. 100 million! records, and if we join three such tables we will making the database handle 1,000,000,000,000 1 trillion records. To avoid overloading the database server one should try and join minimum number of tables.
b.      Note that the limit of three has nothing to do with a limit of 1 trillion records. The latest techniques of fetching database use hash joins which are much more efficient than Cartesian joins and one will be better off if the database uses this instead. But one cannot tell the database to use Hash Joins instead of Cartesian products. One needs to use the explain plan to determine the joins that the database uses and plan according.
c.       Do not hesitate too much to denormalize the tables if it is found that one is always ending up joining too many tables.
d.      Leverage the memory of the process which is processing the records to cache certain master data, list of value types of data so that one can avoid joins with the tables that contain such data. E.g. a typical table is most application will be a StatusMaster. Now this table will have a list of statuses and the description for these statuses. To show the data to the user on the screen one may end up joining with this table all the time. This can be avoided if one has kept a hash table of the statuses in memory of the process. When the records are fetched on the status id is obtained from the table, but one can look up the hash table to get the description to be shown to the user.
4.       Do not use Distinct unless absolutely necessary.
a.       When the database finds a distinct clause it has to fetch the required rows, it has to then sort it on the columns that have been marked distinct and it has to remove the duplicates. If we are sure that the values will be distinct then do not use the keyword distinct. If possible try and change the application and if necessary the database to ensure that the distinct keyword can be avoided. This is important in table which have large amount of data.
5.       Do not use Order By unless absolutely necessary.
a.       If an order by clause is specified then the database needs to fetch the records and then it needs to sort the data. This is an extra overhead for the database. Do not use the sort unless absolutely necessary.
6.       Do not use column functions with a column in the where clause.
a.       If a column function is used along with a column in the where clause then any index on the column cannot be leveraged and this will make the query slower. E.g. where toupper(name) = ?. Programmatically ensure that it is not necessary have a column function in the where clause. If necessary add an additional column on which the search can be executed.
7.       Similarly do not convert data types from one to another, especially in where clauses. If the column is an integer then use an integer value to compare do not use a string value and compare, even though the database may have the ability to do it. Data Type castings/conversions are expensive.
8.       Do not use “in” with a sub-query that will return a large number of values in the “in” clause. Instead use “exists” clause if possible.
a.       If the in clause has a large list of values then the database finds it difficult to compare the value from the row to the values in the list. This slows down the query drastically. Instead use a join to fetch the data in such scenarios.
9.       Do not use a query like select … from tableb b1 where columna = (select max(columna) from tableb b2 where b1.columnx = b2.columnx).
a.       This kind of query is typically required to be executed in tables where we need to pick up the latest record from the history of records. Instead of this adopt one of the following strategies
                                                   i.      Maintain the history in a separate table so that the main table has only the latest information and so the correlated subquery is not required.
                                                 ii.      Maintain a flag in the table which will indicate which is the latest record. When a new version of the record is being entered then update the flag in the earlier record to indicate that it is not the latest and in the new record mark this flag to indicate that this is the latest. Include the flag column in the appropriate index.
10.   Query with integers and numbers wherever possible. Avoid querying by Strings. It is easier to compare two numbers compared to comparing two string.
11.   Avoid usage of “like”. Especially avoid usage of wildcard on both sides of a string column. E.g. name like “%smith%”.

Some Design Considerations/Optimizations

12.   To generate sequences for a primary column or for a unique column where the temporal sequence of these numbers do not matter and one can tolerate gaps use the auto-increment feature that is available in most databases. In case of Oracle use “sequences” to generate the values for these columns.
a.       In Oracle create the sequence with the keywords “cache” and “noorder”. Along with “cache” specify an integer number which specifies the number of sequence ids that should be cached. This will be important for tables where this number needs to be generated very frequently.
13.   Create indexes judiciously. If a table has too many indexes then the inserts and updates into the tables will become slower as it now needs to update the indexes too.
14.   When creating an index ensure that the cardinality of the values in the index is not too small. “Cardinality” of a column indicates the uniqueness of the values in the column. More unique the data in the column higher the “Cardinality” of the column. Creating an index in a column which has a very low “Cardinality” does not help as the database will not be able to use this column effectively to fetch the rows. E.g. if there is a column which contains only a “Y”/”N” value or has a list of status codes, creating an index on such columns will not help query optimization.
15.   Use the clustered index option only if required. An ideal clustered index should reflect the temporal order in which the records are to be inserted into the table. If this is not the case then the inserts and updates can become slow as the data in the table increases.
16.   Use the smallest “standard” datatype for the columns. This is important because if one uses an unnecessarily large datatype then one will end up using up unnecessary memory on the database server everytime the records are processed. E.g. try and use INTEGER or a SMALLINT for all columns which are Integers. Do not use a bigint because it is too big and do not use a byte as it may not be supported on all the databases.

Summary

To get a good performance from the database we need to follow the following norms:
1.       Keep the design simple
2.       Keep the queries simple and ensure that the database does not have to do something that is not necessary
3.       Fetch only that data that is required
4.       Normalize but do not forget to denormalize where required
In the following code

package com.tcs.ngps.htmlparsers

import org.jsoup.Jsoup
import org.jsoup.nodes.Document

class JSoupParser {

    public JSoupParser() {

    }

    static main(args) {
        String lString = (new File(args[0])).getText()
        Document lDocument = Jsoup.parse(lString)
        lDocument.outputSettings().prettyPrint(true);
        //(new File(“D:/Temp/AWRJSoupParsed.html”)).newWriter().write(lDocument.toString())
        def lBufferedWriter = new BufferedWriter(new FileWriter(new File(“D:/Temp/AWRJSoupParsed.html”)))
        lBufferedWriter.write(lDocument.toString())
        lBufferedWriter.close()
    }
}The above code reads an HTML file and converts it to a XML file using JSoup parser. It then writes the XML to a file.
Now if the code 
(new File(“D:/Temp/AWRJSoupParsed.html”)).newWriter().write(lDocument.toString())
is used to write the file to the output the output file is always smaller than the total length of the string (when the String is large). Only when one uses the following code
def lBufferedWriter = new BufferedWriter(new FileWriter(new File(“D:/Temp/AWRJSoupParsed.html”)))
lBufferedWriter.write(lDocument.toString())
lBufferedWriter.close()

Is the whole file written. Unable to figure out why.

Beautiful Data: The Stories Behind Elegant Data Solutions (Theory In Practice, #31)Beautiful Data: The Stories Behind Elegant Data Solutions by Toby Segaran
My rating: 3 of 5 stars

Beautiful Data Edited by Toby Segaran and Jeff Hammerbacher

Found the Chapters 7 (Data Finds Data), 14 (Natural Language Corpus Data) and 20 (Connecting Data) to be very interesting. The others will be interesting based on one’s areas of interest.

Chapter 1 – Seeing your life in Data
The first chapter is a description of two projects by Yahoo. The first one captures the area through which the user is traveling. The aim is to show the user how much of pollution the user has been subject to. The challenge was to show the exposure levels along with the movements and stopping of the user. A variety of techniques were used before finally arriving at a way of overlaying the exposure as a coloured line and showing the locations where the user stayed for a long time as circles. The colour of the line/circle indicated the exposure levels.
The second project involved analyzing the activity tweets of the users and give them a perspective of their behaviour. The challenge in this was to make the user tweet their activities. This was achieved by asking the user to set a goal and then tweet so that the application could analyze the activities of the user so as to provide hints on what were the positive and negative activities.

Chapter 2 – The Beautiful People: Keeping users in mind when designing data collection methods
The second chapter is a project that was carried out to request for a survey of old people on the usage of a product that was to be launched. The challenges were, how to survey a population that is not comfortable with computer; how to survey a population whose vision may be deteriorating due to old age; how to induce them to respond to a survey; how to identify the geography from which they are responding?
To induce the responders to go through the survey the survey was kept as small as possible. The questions were displayed such that the users did not get a feeling of having to answer too much or too little. The fonts were selected so that the question stood out. The questions were worded to avoid all ambiguity. The geographic region was determined by using the IP addresses from where the users responded. In effect the project manage to garner response far beyond the normal response to any survey.

Chapter 3 – Embedded Image Data Processing on Mars
The third chapter presents the challenges of Image Processing on board the Mars rover. The processing power was a challenge as one worked with toughened CPUs and not the normal CPUs (CPU clock speed was only 20 MHz). The RAM was also limited to a few MBs. The only storage was some flash memory. The VxWorks Real Time Operating System was used as the OS. The code was written in C. The main functionality was to store the images captured by the on board cameras (each image was 1 Mega Byte in size), process them for any errors, and downlink it to earth when the connectivity was available.
The final solution involved building a queued system where the image captured by the camera was stored in a section of the flash drive. The image processor picked up this image and cleaned it up. The downlink module then picked this up and sent it to the earth and after successful downlink marked this space as available for the next image. All this happened without the position of the image being shifted in the flash drive.

Chapter 4 – Cloud Storage Design in a PNUTSHELL
The fourth chapter describes the design of PNUTShell a distributed database at Yahoo, used for a variety of purposes. The key requirement was to have geographic replication and distribution of the data with minimum latency. Some of the principles that were followed were:
1. Each record had a mastership, which would the place where it would get updated. It will change only if it is observed that the update is taking from another geographic location consistently for a long period of time.
2. The tables were not modeled in the traditional but were logically grouped together based on the expected access.
3. The Replication order was fixed from one geography to another and was not random. This ensured that the replication traveled minimum distance.
4. The system provided the application the ability to choose between availability and consistency on a per table basis. So if the main replica goes down the application can choose to make the table available and choose to sacrifice consistency or it can stall the functionality and choose to sacrifice availability till the main partition is up.
5. Because of the data structure being used certain complex queries were not possible.

Chapter 5 – Information Platforms and the Rise of the Data Scientist
The Fifth chapter talks of the rise of Data Scientist. These are people expected to be able to take a vast amount of data, process it quickly and make sense out of this data. The chapter talks about how the data about access patterns being collected at Facebook was initially setup in MySQL, how it moved to Oracle and finally had to be moved to Hadoop as the data being generated grew in size.

Chapter 6 – The Geographic Beauty of Photographic Archive
The sixth chapter talks of an exercise carried out in England where the users were asked to take photographs of the locality that they move around in and geotag them and provide tags to denote the topography of the terrain. These images were then used to create tree maps which provided insights into the geography of England as well as the differences in the semantics used by the people in the different regions.

Chapter 7 – Data finds Data
The seventh chapter speaks about a very interesting concept of Data find Data. Some of the examples quoted are
1. A guest calls the reception at 8:00 AM asking them to give a wake up call at 12:00 PM. But soon the maid knocks the door for housekeeping. If “data could find data” then the maid’s schedule would have been altered so that the guest is not disturbed.
2. A user searches for a soon to be released book which is not yet available on Amazon. Now if Amazon can keep track of this search and can notify the user when the shipment of books arrives then the user could buy it if interested. If this notification did not go through then the user may come back after a month only to find that the book is sold out.
3. A parent checks a website for the safety of a particular street as her child walks to the school through the path. There is nothing to report on the street. Now if the website can keep track of this inquiry and if an incident occurs, notify the parent of the incident the parent would be altered. Otherwise the parent may never notice till the next time she comes and check the site.
4. Government departments do not generally share data amongst each other given the security, privacy requirements. This prevents identification of patterns. E.g. the same person being involved in terrorism and narcotics could show up if these departments use “Data finds Data” principle and look up each other’s databases, possibly in a restricted, secure fashion.
A federated search across the various data stores is one way to achieve this, but this can be tedious and long drawn process considering the amount of data that one needs to parse. The way to achieve this would be extract and classify the data and then act on this data. A system that needs to implement this feature needs to have the following blocks:
1. The existing of, and availability of, observations
2. The ability to extract and classify features from the observations
3. The ability to efficiently discover related historical context
4. The ability to make assertions (same or related) about new observations
5. The ability to recognize when new observations reverse earlier assertions.
6. The ability to accumulate and persist this asserted context
7. The ability to recognize the formation of relevance/insight
8. The ability to notify the appropriate entity of such insight.

Chapter 8 – Portable Data in Real Time
The eighth chapter is about how Gnip uses event driven mechanism, rather than polling mechanism, to gather social data, “clean” and “normalize” it provide it to the subscribers. The key take away from this how bad polling for data is, compared to data being pushed to the subscribers.

Chapter 9 – Surfacing the Deep Web
The ninth chapter is about how the search engines only scanned the static contents of the websites and failed to scan “deep”. E.g. scanning a second hand car website for all models available. Typically this will involve the user providing some inputs like in which state is she looking for second hand cars, the models that she is looking for, the price range she is looking for. After this the user will need to click a button to get the list of matching cars. A normal search engine bot will not do any of this. All it will indicate to somebody searching for second hand cars is that here is a site that sells second hand cars. If instead it were able to dig “deep” into the site by simulating a real user and index these pages too, then it will be possible for this search engine to provide more details. But it is not easy to achieve this. As one needs to think about what parameter values should one search with. The permutation/combination of parameter values will be too many, but most may not make sense. The chapter talks about a way by which they first try and probe the site with generic values like “*” or blank for parameter values. Parse the output of this and try and gauge the actual parameters that can be passed to it to get meaningful data. The idea is not to get every single data stored behind the scenes, but to try and cover most types of data that is hidden and index them so that the search engine is enriched.

Chapter 10 – Building Radiohead’s House of Cards
The tenth chapter is about how Radiohead’s music video “House of Cards” was made without any camera, but using only data captured using two main equipments which measured the amount of light reflected back by the environment in which it is operated. The whole data is captured as a set of XY coordinates and the intensity of the point. This data is available at http://code.google.com/radiohead, for people to play around with and create their own videos and effects.

Chapter 11 – Visualizing Urban Data
The eleventh chapter is about how a website was built for the residents of Oakland, Chicago by scrapping data from the Police department and displaying it on a map so that it made it easier for the residents to track crime in the streets of Oakland. Similar to the first chapter, this involves displaying the crimes on a map. The challenges are similar to what was seen in the first chapter. The site was briefly shutdown as the police department figured out what these people were doing and stopped access to their websites. The scrapping of site stopped when finally the police department relented and started giving them data at the end of the day in an Excel sheet.

Chapter 12 – The Design of Sense.us
The twelfth chapter is a project that took the US census data for the 100+ years and built visualizations around it. Some interesting concepts were:
1. Usage of stacked line charts to indicate different percentations. E.g. the percentage of population engaged in a particular field of work, or percentage of population by place of birth. These show the trends over a period of time. These can further drilled down as another set of stacked line charts.
2. A population pyramid was a chart where the left side represented the statistics of male population and the right side show the population of female population. The Y axis was the age group and a year slider provided the user to visualize the change in the population by age group over a period of time.
3. Doubly linked discussion. The project provided the ability to the users to annotate the views. These were linked to the current view that user was seeing. The users could navigate from the comments to the views and views to the comments. Hence doubly linked.
4. A provision was given to annotate the graphs graphically. So users could over lay the view with lines, circles etc and could share these with others.
5. There was also a feature by which the views could be collected by the user for later viewing or for sharing with others. This involved the user just indicating that she wished to “Add View” to her graphical list of bookmarks.
This project was not released to the outside world. Instead the technology was used by IBM to build the site many-eyes.com and can be used by users to create their own visualization, upload their data and visualize this data using the visualization that was created by them.

Chapter 13 – What Data Doesn’t Do
This chapter talks about scenarios which cannot be detected using data. Somethings which seem trivial to the eye will not be easy to figure out using data. E.g. presence of an ugly duckling amongst a set of swans in an image can be immediately spotted by the eye, but detecting the same with the data that represents the image is not easy.
Simlarly reading the line “Iamnotgoingtocomehometoday” is easy for the eye, but not easy for the computer.
Similarly there are scenarios where the computer will be able to figure out trends more easily than the human eye can. E.g. from a complex scatter plot it will be difficult for the eye to detect a trend whereas using mathematically techniques a computer would be able to find a suitable trend.
Sometimes one can tend to be misled through a biased “narrative”. This is called as “narrative fallacy”. E.g. if one is shown a set of graphs and is told that these represent the stock prices of three companies in the manufacturing industry and if one is asked to pick the stock that will perform well in the coming days one would tend to guess based on the data provided. Humans tend to builda story around the data to support their conclusions.
The tendency to apply a past conclusion to the present analysis is called “confirmation bias”.
The author states that data does not necessarily drive one in the right direction because
1. Our tools for using data are inexact.
2. We process data with known biases.
The author also makes a set of statements:
1. More data isn’t always better: The argument is that this applies well for data that has a normal distribution, but not all data show normal distribution and so it does not necessarily apply in all scenarios.
2. Mode data isn’t always easy: Capturing, storing and processing large amounts of data is not easily done even given the advancement that we have seen in the processors.
3. Data along doesn’t explain: The author argues that “given two variables correlated in a statically significant way, causality can work forward, backward, in both directions or not at all”. The author cites an example of how an article in Wall Street Journal had suggested that since “premartial cohabitation is correlated with a higher rates of divorce, unwed couples could avoid living together in order to improve their chances of staying together after marriage”. A very skewed conclusion at best.
4. Data isn’t good for a single answer: Analysis of data does not lead to a single conclusion most of the times. It usually points to possible conclusions.
5. Data doesn’t predict: In a controlled environment it is possible to predict an outcome with near certainity, but in domains with less certainity, such as human or physical behaviour, modeling is an important tool to help explain patterns and in the eagerness one can tend to overfit a model.
6. Proabability isn’t intuitive: The author cites and example of how probability is not always intuitive. The author states “when using data to answer a question, we don’t know what evidence to exclude and how to weigh what we include”.
7. Probabilities aren’t intuitive: The author states that when we dealing with multiple probabilities, it becomes even more tricky and one tends to get biased by prior experience.
8. The real world doesn’t create random variables: Sometimes one can carried away by statistics, forgetting that statistics is not laws of nature. This can lead to very wrong conclusions. In the real world there is lot of interconnection and the data observed is not random independent values.
9. Data doesn’t stand alone: It is not easy to make a decision based only one data. E.g. when one has to take a decision on whether to give a loan to a person or not, it is not only the financial credentials that influences the decision, but factors like the social background of the applicant and the approver also influence the outcome.
10. Data isn’t free from the eye of the beholder: The same data viewed and analyzed by different persons can lead to different conclusions because of their personal cognitive biases.

Chapter 14 – Natural Language Corpus Data
This chapter talks about how the corpus of natural language words (tokens) that has been accumulated by Google over the years can be effective used to improve natural language text interpretation. One exmaple that the author illustrates is “Word Segmentation”. This involves interpretting a phrase like choosespain.com. Does this mean “choose spain” or “chooses pain”? Based on the frequency of the these two terms coming together, in the collection of phrases at Google, it will be possible to say with reasonable certainity that the phrase is “chooose spain” as the number of occurences of the phrase “choose spain” is 3120 whereas there no occurences of the phrase “chooses pain” in the collection of phrases.
But if one considers a phrase like “insufficientnumbers” it becomes difficult to determine if this means “in sufficient numbers” or insufficient numbers”. A human eye may be able to make it out based on the context, but for a computer to determine this will become difficult especially when one knows that the number of occurences of “in sufficient numbers” is 32378 and occurences of “insufficient numbers” is 20751.
The second topic that the author talks about is how to use this data to decipher secret codes and the third topic that is discussed is correction of spellings. The author mentions that the other interesting applications would be
1. Language Identification
2. Spam Detection and Other Classification Tasks
3. Author Identification
4. Document Unshredding and DNA Sequencing
5. Machine Translation

Chapter 15 – Life in Data: The story of DNA
In this chapter the author describes how the billions and billions of possible sequencing of DNA was achieved.

Chapter 16 – Beautifying Data in the Real World
This chapter starts off with the premise that there are two fundamental problems with collecting “beautiful data”
1. Universe is inherently noisy and so one will tend to get different readings of the same experiment given slightly varying circumstances.
2. The second problem is of space limitations. Raw data based on which conclusions are drawn tend to be too large. The question that arises is how does one present the raw data so that somebody looking at the conclusions drawn can validate the relevance of the conclusion?

The author goes on to describe different techniques were used to minimize these problems in collection of data with respect to chemical behaviour of the different chemicals.

Chapter 17 – Superficial Data Analysis: Exploring Millions of Social Stereotypes
This talks about the site facestat.com. This site allows users to post their photos and ask for the world to comment on their appearance. When sufficient data was collected this was used to analyze stereotypes. One of the standout observation was the people tended to rate women as more beautiful than men. The majority of children were considered to be cute or beautiful as opposed to any other age group.

Chapter 18 – Bay Area Blues: The effect of housing crisis
This chapter talks about how the data of sale of houses in California over the a period of 2000 to 2010 was analyzed to try and figure out the impact of the housing loan crisis on the price of the houses. The data was analyzed for various perspectives and it provided different insights.

Chapter 19 – Beautiful Polical Data
This chapter explains how election data from different elections in the US was taken and analyzed to yield some very interesting insights into how the people tended to vote.

Chapter 20 – Connecting Data
This chapter talks about how similar/same data obtained from different sources can be linked to find the right matches. The chapter talks about how data stored in a graph form can be identified to be same by trying to approach the data from different directions. E.g. to identify two movies are same one can try to reach the movie through the actors, through the director, through year of release and if all of them/or majority of them lead to two different nodes from different sources then it is very

View all my reviews


Categories