{"id":1639,"date":"2013-08-16T16:41:02","date_gmt":"2013-08-16T21:41:02","guid":{"rendered":"http:\/\/bardagjy.com\/?p=1639"},"modified":"2013-08-17T12:25:34","modified_gmt":"2013-08-17T17:25:34","slug":"colors-of-the-internet","status":"publish","type":"post","link":"https:\/\/bardagjy.com\/?p=1639","title":{"rendered":"Colors of the Internet"},"content":{"rendered":"<div id=\"attachment_1649\" style=\"width: 620px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/bardagjy.com\/wp-content\/uploads\/2013\/08\/webcolors100.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-1649\" src=\"http:\/\/bardagjy.com\/wp-content\/uploads\/2013\/08\/webcolors100-610x482.png\" alt=\"Andy Bardagjy, Constituent colors of top 100 websites on the internet (August 4, 2013).\" width=\"610\" height=\"482\" class=\"size-large wp-image-1649\" srcset=\"https:\/\/bardagjy.com\/wp-content\/uploads\/2013\/08\/webcolors100-610x482.png 610w, https:\/\/bardagjy.com\/wp-content\/uploads\/2013\/08\/webcolors100-300x237.png 300w, https:\/\/bardagjy.com\/wp-content\/uploads\/2013\/08\/webcolors100-299x236.png 299w, https:\/\/bardagjy.com\/wp-content\/uploads\/2013\/08\/webcolors100.png 2010w\" sizes=\"(max-width: 610px) 100vw, 610px\" \/><\/a><p id=\"caption-attachment-1649\" class=\"wp-caption-text\">Andy Bardagjy, Constituent colors of top 100 websites on the internet. August 4, 2013.<\/p><\/div>\n<p>Allison and I had been thinking about decomposing scenes, art, and geometries into representative colors, textures, and features. Then, during an inspiring walk around the <a href=\"http:\/\/www.moma.org\/\">MOMA<\/a>, I spotted a woodblock print (below) by <a href=\"https:\/\/en.wikipedia.org\/wiki\/Sherrie_Levine\">Sherrie Levine<\/a>. In her prints <a href=\"http:\/\/www.artnotart.com\/sherrielevine\/arts.04.90.html\">Meltdown<\/a>, she decomposes paintings by Duchamp, Kirchner, Mondrian, and Monet into their constituent colors. Can you guess which color-set corresponds to which artist?<\/p>\n<div id=\"attachment_1647\" style=\"width: 620px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/bardagjy.com\/wp-content\/uploads\/2013\/08\/Sherrie_Levine_Meltdown_All.jpeg.jpg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-1647\" src=\"http:\/\/bardagjy.com\/wp-content\/uploads\/2013\/08\/Sherrie_Levine_Meltdown_All.jpeg-610x210.jpg\" alt=\"Sherrie Levine, Meltdown. 1989.\" width=\"610\" height=\"210\" class=\"size-large wp-image-1647\" srcset=\"https:\/\/bardagjy.com\/wp-content\/uploads\/2013\/08\/Sherrie_Levine_Meltdown_All.jpeg-610x210.jpg 610w, https:\/\/bardagjy.com\/wp-content\/uploads\/2013\/08\/Sherrie_Levine_Meltdown_All.jpeg-300x103.jpg 300w, https:\/\/bardagjy.com\/wp-content\/uploads\/2013\/08\/Sherrie_Levine_Meltdown_All.jpeg-299x103.jpg 299w, https:\/\/bardagjy.com\/wp-content\/uploads\/2013\/08\/Sherrie_Levine_Meltdown_All.jpeg.jpg 1279w\" sizes=\"(max-width: 610px) 100vw, 610px\" \/><\/a><p id=\"caption-attachment-1647\" class=\"wp-caption-text\">Sherrie Levine, Meltdown. 1989.<\/p><\/div>\n<p>I&#8217;m interested in similar studies. I want to leverage the internet to understand composition, colors, and themes of designs and art which influence our society. This project builds on those ideas by exploring the colors which make up the most visited sites on the internet.<\/p>\n<h2>How it Works<\/h2>\n<p>These days I&#8217;ve been writing a lot of Python (2.7) and figured this would be a nice project to hack on during <a href=\"http:\/\/pydata.org\/bos2013\/\">PyData<\/a>. In the script <code>webcolors.py<\/code>, I first grab a list of the top 1 million most visited websites generously compiled by <a href=\"http:\/\/www.alexa.com\/topsites\">Alexa<\/a>. Then, for each of those sites, I compute their constituent colors and save them to a big JSON document (I usually use a database for this, but wanted to play with JSON). In a separate python script, I use <a href=\"http:\/\/matplotlib.org\/\">matplotlib<\/a> to plot the colors.<\/p>\n<p>Initially I expected this to be an exercise in text parsing. I would pull down the source files which are rendered into the webpage, search them for color tags and generate a visualization. I quickly realized that approach was futile. Most of the top websites are heavily reliant on Flash, javascript or other technologies. I realized that in order for me to deduce the colors of a modern website, I&#8217;d have to render it &#8211; effectively building an entire web-browser. Not to mention, many websites have large images, which are impossible to parse simply using text based analysis of a website.<\/p>\n<p>To find the colors of a webpage, I would have to render the page using a browser. I chose <a href=\"https:\/\/code.google.com\/p\/chromedriver\/\">Chrome<\/a> powered through the <a href\"http:\/\/docs.seleniumhq.org\/projects\/webdriver\/\">Selenium<\/a> webdriver. Using selenium, I grab a base64 encoded screenshot of the website. I convert it into a <a href=\"https:\/\/github.com\/python-imaging\/Pillow\">Python Imaging Library<\/a> (PIL) image. The PIL image is converted to RGB and resized to QVGA (320&#215;240) to speed the color computations.<\/p>\n<p>To compute the representative colors of each webpage, I use <a href=\"https:\/\/en.wikipedia.org\/wiki\/K-means_clustering\">k-means clustering<\/a> with (sort-of) <a href=\"https:\/\/en.wikipedia.org\/wiki\/Expectation%E2%80%93maximization_algorithm\">expectation maximization<\/a> to pick the optimal number of clusters. I usually cap the maximum number of clusters at six, though sometimes I allow up to ten clusters. I rely heavily on <a href=\"http:\/\/docs.scipy.org\/doc\/scipy\/reference\/index.html\">SciPy<\/a> for most of the heavy lifting here. The images are posterized with their representative colors, and the number of each pixel is computed. From there, the colors, plus some other metadata (load times, rank etc) are written to a JSON file for future visualization.<\/p>\n<p>In the script, <code>webplot.py<\/code>, I parse the saved JSON document and plot the results via <a href=\"http:\/\/matplotlib.org\">Matplotlib<\/a>. In the topmost image, the top 100 websites are plotted from left to right. The top three sites in that figure are Google, Facebook, and Youtube. It&#8217;s easy to notice a bug when examining the colors for Google (note, this is normal google.com not a doodle). Notice how the three colors are light gray, dark gray, and white &#8211; not the typical red, green, blue, yellow color scheme. Why? Well, when the image screenshot is resized to 320 x 240 pixels for processing, the colors are dithered. The number of pixels in the new image that lie *between* red, green, blue, yellow and white &#8211; the dominant background color &#8211; is much larger than the number of pixels that are colored. Because of dithering, those between pixels are closer to shades of gray, than colors, and thus the k-means clustering (with EM) finds shades of gray and white to be the &#8220;color of Google&#8221;. I&#8217;m not sure if this is a bug.. what do you think?<\/p>\n<h2>Grab the Source<\/h2>\n<p>As always, you can check out my source <\/p>\n<pre lang=\"bash\" line=\"0\">git clone git:\/\/git.bardagjy.com\/webcolors<\/pre>\n<p>I&#8217;m fairly pleased with the code. There are a few things I&#8217;d like to make a bit cleaner. First of all, I&#8217;d like to add command line input parameters rather than have them at the top of the file. <\/p>\n<p>When I first designed it, the <code>urlimg<\/code> method simply took in a url and optionally a a screenshot resolution and returned a base64 encoded screenshot. For performance reasons, I now open one browser in the main method, then pass the browser object to the method that screenshots each page. This makes the <code>urlimg<\/code> method a bit less modular. In the future, I&#8217;ll make the browser parameter optional, if a browser object is not passed in, the method will create a browser object, take a screenshot, and close the browser. However, if a browser object is passed in, the function will use it to produce a screenshot.<\/p>\n<p>Another issue is how I store the colors and other metadata of each webpage. Though I would normally put all of this in a database, I decided to store this stuff as a big JSON object because I&#8217;ve been messing with JSON at work. JSON appears to be a fine choice for this type of data, but instead of incrementally writing to the file (like I should), I store everything in a big dictionary which I dump to a JSON file when everything has finished. The good news is 1000 sites is only around 250KB, so 1M sites should be around 250MB, it&#8217;s bad, but I can keep that in memory. <\/p>\n<p>The last fudge is how I deal with slow to load websites. During testing, I found that some websites took longer than 30,000 us to load. This causes the chromium driver to timeout, throwing an exception. My hypothesis was that the slow load time was a result of temporary network or computer (I often ran it in a small virtual machine) issues. To resolve this, I simply caught the exception in a try \/ except statement, and tried to load the page again. If the page threw a timeout exception five times, the url was skipped. During testing, when evaluating the top 1000 urls this fix entirely resolved the problem, no urls were skipped.<\/p>\n<p>Finally, because the code is dependent on so many external packages, in the future, maybe I&#8217;ll package it up so it can be installed via <code>pip<\/code>. <\/p>\n<h2>Popups and Porn<\/h2>\n<p>I learned a lot about the internet while I was running this experiment, things that I hope to turn into other visualizations and studies. Five or six of the sites in the top 100 are <strong>porn<\/strong> &#8211; this makes running the scraper in the background at work a bit awkward. Probably just shy of half of the top 100 sites are in English (I expected fewer). A shocking number of sites in the top 100 <strong>autoplay music or videos<\/strong> when the site is loaded. I thought that was a thing of the past! Another surprising thing is the number of <strong>spammy popups<\/strong> generated by the top 100 that are able to defeat Chromium&#8217;s <code>better-popup-blocking<\/code>. To make running this a bit less obtrusive, I run it inside an Ubuntu virtual machine!<\/p>\n<h2>Top 1000<\/h2>\n<div id=\"attachment_1714\" style=\"width: 620px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/bardagjy.com\/wp-content\/uploads\/2013\/08\/webcolors1000.pdf\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-1714\" src=\"http:\/\/bardagjy.com\/wp-content\/uploads\/2013\/08\/webcolors1000-610x482.png\" alt=\"Colors of the top 1000 websites on the internet.\" width=\"610\" height=\"482\" class=\"size-large wp-image-1714\" srcset=\"https:\/\/bardagjy.com\/wp-content\/uploads\/2013\/08\/webcolors1000-610x482.png 610w, https:\/\/bardagjy.com\/wp-content\/uploads\/2013\/08\/webcolors1000-300x237.png 300w, https:\/\/bardagjy.com\/wp-content\/uploads\/2013\/08\/webcolors1000-299x236.png 299w, https:\/\/bardagjy.com\/wp-content\/uploads\/2013\/08\/webcolors1000.png 1011w\" sizes=\"(max-width: 610px) 100vw, 610px\" \/><\/a><p id=\"caption-attachment-1714\" class=\"wp-caption-text\">Andy Bardagjy, Constituent colors of top 1000 websites on the internet. August 6, 2013. (links to pdf).<\/p><\/div>\n<p>Finally, a visualization of the colors of the top 1000 websites. Very weird how there seems to be a &#8220;noise floor&#8221;, a level where many sites seem to have the same number of pixels &#8211; though not the same colors. To me, this indicates either a bug in my algorithms, or maybe a common design aesthetic? Does the distribution follow the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Golden_ratio\">Golden Ratio<\/a> or the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Power_law\">Power Law<\/a>?<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Allison and I had been thinking about decomposing scenes, art, and geometries into representative colors, textures, and features. Then, during an inspiring walk around the MOMA, I spotted a woodblock print (below) by Sherrie Levine. In her prints Meltdown, she decomposes paintings by Duchamp, Kirchner, Mondrian, and Monet into their constituent colors. Can you guess [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":1653,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"ngg_post_thumbnail":0,"footnotes":""},"categories":[9,8,4],"tags":[135,136,134,137,141,133,104,132],"_links":{"self":[{"href":"https:\/\/bardagjy.com\/index.php?rest_route=\/wp\/v2\/posts\/1639"}],"collection":[{"href":"https:\/\/bardagjy.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/bardagjy.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/bardagjy.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/bardagjy.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1639"}],"version-history":[{"count":38,"href":"https:\/\/bardagjy.com\/index.php?rest_route=\/wp\/v2\/posts\/1639\/revisions"}],"predecessor-version":[{"id":1718,"href":"https:\/\/bardagjy.com\/index.php?rest_route=\/wp\/v2\/posts\/1639\/revisions\/1718"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/bardagjy.com\/index.php?rest_route=\/wp\/v2\/media\/1653"}],"wp:attachment":[{"href":"https:\/\/bardagjy.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1639"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/bardagjy.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1639"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/bardagjy.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1639"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}