Sundarrajk's Weblog

Archive for the ‘Statistics’ Category

Beautiful Data: The Stories Behind Elegant Data Solutions (Theory In Practice, #31)Beautiful Data: The Stories Behind Elegant Data Solutions by Toby Segaran
My rating: 3 of 5 stars

Beautiful Data Edited by Toby Segaran and Jeff Hammerbacher

Found the Chapters 7 (Data Finds Data), 14 (Natural Language Corpus Data) and 20 (Connecting Data) to be very interesting. The others will be interesting based on one’s areas of interest.

Chapter 1 – Seeing your life in Data
The first chapter is a description of two projects by Yahoo. The first one captures the area through which the user is traveling. The aim is to show the user how much of pollution the user has been subject to. The challenge was to show the exposure levels along with the movements and stopping of the user. A variety of techniques were used before finally arriving at a way of overlaying the exposure as a coloured line and showing the locations where the user stayed for a long time as circles. The colour of the line/circle indicated the exposure levels.
The second project involved analyzing the activity tweets of the users and give them a perspective of their behaviour. The challenge in this was to make the user tweet their activities. This was achieved by asking the user to set a goal and then tweet so that the application could analyze the activities of the user so as to provide hints on what were the positive and negative activities.

Chapter 2 – The Beautiful People: Keeping users in mind when designing data collection methods
The second chapter is a project that was carried out to request for a survey of old people on the usage of a product that was to be launched. The challenges were, how to survey a population that is not comfortable with computer; how to survey a population whose vision may be deteriorating due to old age; how to induce them to respond to a survey; how to identify the geography from which they are responding?
To induce the responders to go through the survey the survey was kept as small as possible. The questions were displayed such that the users did not get a feeling of having to answer too much or too little. The fonts were selected so that the question stood out. The questions were worded to avoid all ambiguity. The geographic region was determined by using the IP addresses from where the users responded. In effect the project manage to garner response far beyond the normal response to any survey.

Chapter 3 – Embedded Image Data Processing on Mars
The third chapter presents the challenges of Image Processing on board the Mars rover. The processing power was a challenge as one worked with toughened CPUs and not the normal CPUs (CPU clock speed was only 20 MHz). The RAM was also limited to a few MBs. The only storage was some flash memory. The VxWorks Real Time Operating System was used as the OS. The code was written in C. The main functionality was to store the images captured by the on board cameras (each image was 1 Mega Byte in size), process them for any errors, and downlink it to earth when the connectivity was available.
The final solution involved building a queued system where the image captured by the camera was stored in a section of the flash drive. The image processor picked up this image and cleaned it up. The downlink module then picked this up and sent it to the earth and after successful downlink marked this space as available for the next image. All this happened without the position of the image being shifted in the flash drive.

Chapter 4 – Cloud Storage Design in a PNUTSHELL
The fourth chapter describes the design of PNUTShell a distributed database at Yahoo, used for a variety of purposes. The key requirement was to have geographic replication and distribution of the data with minimum latency. Some of the principles that were followed were:
1. Each record had a mastership, which would the place where it would get updated. It will change only if it is observed that the update is taking from another geographic location consistently for a long period of time.
2. The tables were not modeled in the traditional but were logically grouped together based on the expected access.
3. The Replication order was fixed from one geography to another and was not random. This ensured that the replication traveled minimum distance.
4. The system provided the application the ability to choose between availability and consistency on a per table basis. So if the main replica goes down the application can choose to make the table available and choose to sacrifice consistency or it can stall the functionality and choose to sacrifice availability till the main partition is up.
5. Because of the data structure being used certain complex queries were not possible.

Chapter 5 – Information Platforms and the Rise of the Data Scientist
The Fifth chapter talks of the rise of Data Scientist. These are people expected to be able to take a vast amount of data, process it quickly and make sense out of this data. The chapter talks about how the data about access patterns being collected at Facebook was initially setup in MySQL, how it moved to Oracle and finally had to be moved to Hadoop as the data being generated grew in size.

Chapter 6 – The Geographic Beauty of Photographic Archive
The sixth chapter talks of an exercise carried out in England where the users were asked to take photographs of the locality that they move around in and geotag them and provide tags to denote the topography of the terrain. These images were then used to create tree maps which provided insights into the geography of England as well as the differences in the semantics used by the people in the different regions.

Chapter 7 – Data finds Data
The seventh chapter speaks about a very interesting concept of Data find Data. Some of the examples quoted are
1. A guest calls the reception at 8:00 AM asking them to give a wake up call at 12:00 PM. But soon the maid knocks the door for housekeeping. If “data could find data” then the maid’s schedule would have been altered so that the guest is not disturbed.
2. A user searches for a soon to be released book which is not yet available on Amazon. Now if Amazon can keep track of this search and can notify the user when the shipment of books arrives then the user could buy it if interested. If this notification did not go through then the user may come back after a month only to find that the book is sold out.
3. A parent checks a website for the safety of a particular street as her child walks to the school through the path. There is nothing to report on the street. Now if the website can keep track of this inquiry and if an incident occurs, notify the parent of the incident the parent would be altered. Otherwise the parent may never notice till the next time she comes and check the site.
4. Government departments do not generally share data amongst each other given the security, privacy requirements. This prevents identification of patterns. E.g. the same person being involved in terrorism and narcotics could show up if these departments use “Data finds Data” principle and look up each other’s databases, possibly in a restricted, secure fashion.
A federated search across the various data stores is one way to achieve this, but this can be tedious and long drawn process considering the amount of data that one needs to parse. The way to achieve this would be extract and classify the data and then act on this data. A system that needs to implement this feature needs to have the following blocks:
1. The existing of, and availability of, observations
2. The ability to extract and classify features from the observations
3. The ability to efficiently discover related historical context
4. The ability to make assertions (same or related) about new observations
5. The ability to recognize when new observations reverse earlier assertions.
6. The ability to accumulate and persist this asserted context
7. The ability to recognize the formation of relevance/insight
8. The ability to notify the appropriate entity of such insight.

Chapter 8 – Portable Data in Real Time
The eighth chapter is about how Gnip uses event driven mechanism, rather than polling mechanism, to gather social data, “clean” and “normalize” it provide it to the subscribers. The key take away from this how bad polling for data is, compared to data being pushed to the subscribers.

Chapter 9 – Surfacing the Deep Web
The ninth chapter is about how the search engines only scanned the static contents of the websites and failed to scan “deep”. E.g. scanning a second hand car website for all models available. Typically this will involve the user providing some inputs like in which state is she looking for second hand cars, the models that she is looking for, the price range she is looking for. After this the user will need to click a button to get the list of matching cars. A normal search engine bot will not do any of this. All it will indicate to somebody searching for second hand cars is that here is a site that sells second hand cars. If instead it were able to dig “deep” into the site by simulating a real user and index these pages too, then it will be possible for this search engine to provide more details. But it is not easy to achieve this. As one needs to think about what parameter values should one search with. The permutation/combination of parameter values will be too many, but most may not make sense. The chapter talks about a way by which they first try and probe the site with generic values like “*” or blank for parameter values. Parse the output of this and try and gauge the actual parameters that can be passed to it to get meaningful data. The idea is not to get every single data stored behind the scenes, but to try and cover most types of data that is hidden and index them so that the search engine is enriched.

Chapter 10 – Building Radiohead’s House of Cards
The tenth chapter is about how Radiohead’s music video “House of Cards” was made without any camera, but using only data captured using two main equipments which measured the amount of light reflected back by the environment in which it is operated. The whole data is captured as a set of XY coordinates and the intensity of the point. This data is available at, for people to play around with and create their own videos and effects.

Chapter 11 – Visualizing Urban Data
The eleventh chapter is about how a website was built for the residents of Oakland, Chicago by scrapping data from the Police department and displaying it on a map so that it made it easier for the residents to track crime in the streets of Oakland. Similar to the first chapter, this involves displaying the crimes on a map. The challenges are similar to what was seen in the first chapter. The site was briefly shutdown as the police department figured out what these people were doing and stopped access to their websites. The scrapping of site stopped when finally the police department relented and started giving them data at the end of the day in an Excel sheet.

Chapter 12 – The Design of
The twelfth chapter is a project that took the US census data for the 100+ years and built visualizations around it. Some interesting concepts were:
1. Usage of stacked line charts to indicate different percentations. E.g. the percentage of population engaged in a particular field of work, or percentage of population by place of birth. These show the trends over a period of time. These can further drilled down as another set of stacked line charts.
2. A population pyramid was a chart where the left side represented the statistics of male population and the right side show the population of female population. The Y axis was the age group and a year slider provided the user to visualize the change in the population by age group over a period of time.
3. Doubly linked discussion. The project provided the ability to the users to annotate the views. These were linked to the current view that user was seeing. The users could navigate from the comments to the views and views to the comments. Hence doubly linked.
4. A provision was given to annotate the graphs graphically. So users could over lay the view with lines, circles etc and could share these with others.
5. There was also a feature by which the views could be collected by the user for later viewing or for sharing with others. This involved the user just indicating that she wished to “Add View” to her graphical list of bookmarks.
This project was not released to the outside world. Instead the technology was used by IBM to build the site and can be used by users to create their own visualization, upload their data and visualize this data using the visualization that was created by them.

Chapter 13 – What Data Doesn’t Do
This chapter talks about scenarios which cannot be detected using data. Somethings which seem trivial to the eye will not be easy to figure out using data. E.g. presence of an ugly duckling amongst a set of swans in an image can be immediately spotted by the eye, but detecting the same with the data that represents the image is not easy.
Simlarly reading the line “Iamnotgoingtocomehometoday” is easy for the eye, but not easy for the computer.
Similarly there are scenarios where the computer will be able to figure out trends more easily than the human eye can. E.g. from a complex scatter plot it will be difficult for the eye to detect a trend whereas using mathematically techniques a computer would be able to find a suitable trend.
Sometimes one can tend to be misled through a biased “narrative”. This is called as “narrative fallacy”. E.g. if one is shown a set of graphs and is told that these represent the stock prices of three companies in the manufacturing industry and if one is asked to pick the stock that will perform well in the coming days one would tend to guess based on the data provided. Humans tend to builda story around the data to support their conclusions.
The tendency to apply a past conclusion to the present analysis is called “confirmation bias”.
The author states that data does not necessarily drive one in the right direction because
1. Our tools for using data are inexact.
2. We process data with known biases.
The author also makes a set of statements:
1. More data isn’t always better: The argument is that this applies well for data that has a normal distribution, but not all data show normal distribution and so it does not necessarily apply in all scenarios.
2. Mode data isn’t always easy: Capturing, storing and processing large amounts of data is not easily done even given the advancement that we have seen in the processors.
3. Data along doesn’t explain: The author argues that “given two variables correlated in a statically significant way, causality can work forward, backward, in both directions or not at all”. The author cites an example of how an article in Wall Street Journal had suggested that since “premartial cohabitation is correlated with a higher rates of divorce, unwed couples could avoid living together in order to improve their chances of staying together after marriage”. A very skewed conclusion at best.
4. Data isn’t good for a single answer: Analysis of data does not lead to a single conclusion most of the times. It usually points to possible conclusions.
5. Data doesn’t predict: In a controlled environment it is possible to predict an outcome with near certainity, but in domains with less certainity, such as human or physical behaviour, modeling is an important tool to help explain patterns and in the eagerness one can tend to overfit a model.
6. Proabability isn’t intuitive: The author cites and example of how probability is not always intuitive. The author states “when using data to answer a question, we don’t know what evidence to exclude and how to weigh what we include”.
7. Probabilities aren’t intuitive: The author states that when we dealing with multiple probabilities, it becomes even more tricky and one tends to get biased by prior experience.
8. The real world doesn’t create random variables: Sometimes one can carried away by statistics, forgetting that statistics is not laws of nature. This can lead to very wrong conclusions. In the real world there is lot of interconnection and the data observed is not random independent values.
9. Data doesn’t stand alone: It is not easy to make a decision based only one data. E.g. when one has to take a decision on whether to give a loan to a person or not, it is not only the financial credentials that influences the decision, but factors like the social background of the applicant and the approver also influence the outcome.
10. Data isn’t free from the eye of the beholder: The same data viewed and analyzed by different persons can lead to different conclusions because of their personal cognitive biases.

Chapter 14 – Natural Language Corpus Data
This chapter talks about how the corpus of natural language words (tokens) that has been accumulated by Google over the years can be effective used to improve natural language text interpretation. One exmaple that the author illustrates is “Word Segmentation”. This involves interpretting a phrase like Does this mean “choose spain” or “chooses pain”? Based on the frequency of the these two terms coming together, in the collection of phrases at Google, it will be possible to say with reasonable certainity that the phrase is “chooose spain” as the number of occurences of the phrase “choose spain” is 3120 whereas there no occurences of the phrase “chooses pain” in the collection of phrases.
But if one considers a phrase like “insufficientnumbers” it becomes difficult to determine if this means “in sufficient numbers” or insufficient numbers”. A human eye may be able to make it out based on the context, but for a computer to determine this will become difficult especially when one knows that the number of occurences of “in sufficient numbers” is 32378 and occurences of “insufficient numbers” is 20751.
The second topic that the author talks about is how to use this data to decipher secret codes and the third topic that is discussed is correction of spellings. The author mentions that the other interesting applications would be
1. Language Identification
2. Spam Detection and Other Classification Tasks
3. Author Identification
4. Document Unshredding and DNA Sequencing
5. Machine Translation

Chapter 15 – Life in Data: The story of DNA
In this chapter the author describes how the billions and billions of possible sequencing of DNA was achieved.

Chapter 16 – Beautifying Data in the Real World
This chapter starts off with the premise that there are two fundamental problems with collecting “beautiful data”
1. Universe is inherently noisy and so one will tend to get different readings of the same experiment given slightly varying circumstances.
2. The second problem is of space limitations. Raw data based on which conclusions are drawn tend to be too large. The question that arises is how does one present the raw data so that somebody looking at the conclusions drawn can validate the relevance of the conclusion?

The author goes on to describe different techniques were used to minimize these problems in collection of data with respect to chemical behaviour of the different chemicals.

Chapter 17 – Superficial Data Analysis: Exploring Millions of Social Stereotypes
This talks about the site This site allows users to post their photos and ask for the world to comment on their appearance. When sufficient data was collected this was used to analyze stereotypes. One of the standout observation was the people tended to rate women as more beautiful than men. The majority of children were considered to be cute or beautiful as opposed to any other age group.

Chapter 18 – Bay Area Blues: The effect of housing crisis
This chapter talks about how the data of sale of houses in California over the a period of 2000 to 2010 was analyzed to try and figure out the impact of the housing loan crisis on the price of the houses. The data was analyzed for various perspectives and it provided different insights.

Chapter 19 – Beautiful Polical Data
This chapter explains how election data from different elections in the US was taken and analyzed to yield some very interesting insights into how the people tended to vote.

Chapter 20 – Connecting Data
This chapter talks about how similar/same data obtained from different sources can be linked to find the right matches. The chapter talks about how data stored in a graph form can be identified to be same by trying to approach the data from different directions. E.g. to identify two movies are same one can try to reach the movie through the actors, through the director, through year of release and if all of them/or majority of them lead to two different nodes from different sources then it is very

View all my reviews

After the threefold increase of salaries of MPs it disheartens one to see a report like “After hike, Indian MP gets 13 times less pay than US lawmaker!”.

How can one compare the pay structure of somebody in the US with somebody in India. Can one do a more sensible calculation by taking into account the relative per capita income in the two countries. The Per Capita income of US is 37,500 USD. In India it is 2,880 USD. Now does the writer of the article wish to compare?

How much can one and how much will one twist the statistics to suit one’s need.

All the article writer wanted was some attention and he is getting plenty of it despite being downright stupid.