April 12th, 2013

Diving in to Data with SPJ

Today, I had a chance to speak at the Society of Professional Journalists’ Region 1 Spring conference, held at Rutgers University in New Brunswick. Debbie Galant of the NJ News Commons and I talked about the projects that came out of our Hack Jersey hackathon. Then I laid out a map for building data skills in the newsroom.

For those who attended (and those who didn’t), here are links to some of the tools we discussed and tutorials to start to learn them.

The slides: bit.ly/R1C13DATA

We started by sharing examples of data-driven news applications on the web:

So how do you do this kind of work (or get to Carnegie Hall)? Practice. Practice. Practice.

A brief detour on the history of data journalism included a data piece in the first issue of the Manchester Guardian in 1821 and the cover of the program for IRE’s first computer-assisted reporting conference in 1993. And now, to today…

The disciplines of data reporting

1. Collection

2. Cleaning

3. Analysis

  • Excel is going to be your favorite tool ever. There are a number of good tutorials on basic Excel and more advanced Pivot Tables listed here, as well as a link to a great, free online course introducing databases.
  • And don’t forget the formula for percent change = (new-old)/old
  • Only for a few more days, you can get 50-percent off of all Excel e-books from O’Reilly here: bit.ly/R1C13EXCEL

4. Visualization

5. Interaction

Support groups

What are your favorite data techniques, tools and tutorials? Please share them with me. I’d love to check them out.

April 4th, 2013

What I learned organizing a hackathon

It all started in September, at the Online News Association’s conference in San Francisco. I crashed the #wjchat party and ended up meeting the infamous Debbie Galant, the force of nature behind Baristanet who was just starting the NJ News Commons. We talked about journalism in Jersey. I floated the idea of a news hackathon, and she was intrigued.

Fast forward through a hurricane, an election, months of planning and promotion, and 24 hours of hacking. I sat in a lecture hall at Montclair State University listening in awe to 11 teams of journalists and programmers pitch their projects to change news in New Jersey.

[Of course, I'm glossing over many logistical details here. I'll just say that we could not have pulled it off without the generosity and advice of our sponsors and partners, especially the News Commons and the university, Knight-Mozilla OpenNews, CartoDB, O'Reilly, The Star-Ledger, The Record, Patch, Echo and many others.]

I couldn’t believe we pulled it off. There’s no way I could have imagined in September how much I’d learn about building community, blending cultures and sowing the seeds for us to think about news and data in new ways. Among those lessons:

  • It was a serious miscalculation on my part to try to start at 9 a.m. on a Saturday. Neither journalists nor developers are early risers. As we waited for folks to drift in, we had to start late and rush some of the presentations.
  • One of the biggest challenges was trying to manage people’s expectations of what they’d get out of giving up a weekend to stare at a computer screen on a college campus. Some people wanted to learn to code. Others wanted nerds to help them with their next big project. A few, I imagine, wanted to network for a job. We wanted everyone to feel like she or he could be a part of this experience and could offer something, no matter previous experience or skill level. For some newsroom denizens whose only exposure to developers was their much maligned IT staff, this was a revolutionary idea. They have been trained to treat technologists like a deli counter: Order up whatever newfangled internet thing you need and wait for it to show up. My hope was at the end of the hackathon, our reporters and developers could start to imagine how they can work together. I think on that front we succeeded.
  • We may have overloaded the schedule with speakers, and our talks were geared too heavily toward journalists. Some of the participants would have liked more discussion of the process of creating news apps rather than the concepts of data journalism and news app thinking. Others wanted less yakking and more hacking.
  • It’s just as important for us to bring programmers to the news world as it is to introduce journalists to the development process. We succeeded on the latter, but we need to do more for the former. We could have used speakers from the tech world who would have lured more programmers.
  • I wish we had more diversity in our roster of speakers. I was thrilled to have Emily Bell as one of our judges. A majority of the folks on our planning committee were women, as were many of our participants. But I think we erred in not having more women presenters, giving a false impression of the state of news development and the programming world as a whole.
  • We just didn’t have enough developers. We needed to have more than one on each team, preferably someone with backend skills and someone else with frontend skills. Some of the teams had one developer and four journalists. That just didn’t work.
  • We found some teams didn’t need more than one or two journalists, even those with significant computer-assisted reporting skills. Some of the CAR veterans had a hard time translating their expertise in analyzing data to the idea of developing a reusable application.
  • To my surprise, many of the programmers were not really familiar with version control or git. Only Sunday morning, as the deadline loomed, did some teams ask about how to use Github, leaving me to run around and teach people. Some teams also didn’t have server space to host a demo site, something I assumed was a given. A suggestion from judge and mentor Jonathan Soo was to have us offer hosted server space for the teams and to require everyone to push an initial commit to Github within the first hour of hacking.
  • The teams with strong projects at the end either came with an idea or settled on one very quickly. We need a mechanism for teams to find members and for people to think about data and project ideas ahead of time. Some teams spent far too much time arguing over ideas or failed to really evaluate the data they were hoping to use. Suggestions were made for a pre-networking event a week before or a forum or email list for people to talk ahead of time.
  • I realized far too late that our website was good for displaying details on the event planning, but the blog was completely unusable. The commenting system didn’t work at all. Next time, we need to test that earlier or set up a functional blog at blog.hackjersey.com.
  • Of our 11 projects, three teams didn’t finish enough to present anything beyond what they learned from not finishing. Another four teams had very rough demos, but had proofs of concept to show. So four teams finished pretty much functional projects.
  • If you want to build a strong community, the best way to start is by recruiting a broad coalition of journalists, developers, designers, hackers, educators, nonprofits and bureaucrats to help plan the event. Thanks in large part to Debbie Galant, we had more than 20 people on our organizing committee, and their ideas and dedication made this hackathon work.

All in all, I considered our first Hack Jersey event a success. And that leads us to think about what’s next. We have a few ideas:

  • An ongoing sponsorship from the New Jersey News Commons and the School of Communication and Media at Montclair State University.
  • A hack day aimed at scraping a poorly structured public dataset that could be hosted for all to use freely.
  • Programming training, perhaps through a partnership with other groups who have already invented that wheel.
  • Basic data reporting training, which many working journalists say they are hungry for.
  • Solicit ideas for newsroom tools from news organizations large and small. Take that list of use cases and host a hack day or series of hack days to build open source tools.

What do you think Hack Jersey should work on next? What would be most useful to you as a journalist? If you’re a developer, what kind of projects would you be interested in working on with us? I’d love to hear your ideas. Please share them in the comments, or email us at hackinfo(at)hackjersey(dot)com.

March 1st, 2013

Learning to commit to version control

At this week’s computer-assisted reporting conference in Louisville, IRE has me doing double-duty. In addition to my class on OpenRefine, I’m also teaching a hands-on session on git and Github.

This assumes you’ve already downloaded git and created an account on Github. I use the command line, although if you master that, the GUI clients will be a piece of cake.

I’ve somehow, inadvertently become a bit of a Git evangelist, not out of any mastery, but mostly because of my convert’s zeal for version control.  And the one-two punch of git and Github have changed the way I think about my job, web development and sharing knowledge. Here’s a quick run-through of the class. If you’re a visual learner, the slides are here: http://bit.ly/car13gitslides

What is git?

  • a distributed version control system (If this doesn’t mean anything to you, don’t sweat it. It’s mostly for the nerds)
  • a command line utility to track changes to a file and to share those changes with others.
  • good for any kind of text - stories, csv, html, .js, .py, .rb.
  • not so great with images, audio, video

What does git do?

In a broad oversimplification, it uses diff to compare every addition and subtraction in your code and shows you how your files evolve.

It lets you take snapshots of your code and roll through them over time (or back in time) as needed to follow how a file changes from save to save to save. And you can have several authors of a file branch off their own versions to edit and then later merge them all back together into one master file.

How does it work?

There’s a lot of things going on behind the curtain in git, that you can figure out eventually, but don’t worry about it here. For right now, basically you only need to really know six commands.

The first thing you want to know is

git status

This will orient you to where you are, what files have changed and whether you’ve saved your snapshot of your project (which we’ll soon start calling a “commit”). This is your friend.

Where you work and where you edit your code is the working directory. When you’re ready to take a snapshot of your code, you “add” your file to the staging area.

git add foo.py

After you’ve made all the changes you want, and you’ve “added” all of your updated files to the staging area, it’s time to make a commit. And with each commit, you want to add a short message describing what the changes are.

git commit -m "finally debugged my own idiocy. maybe..."

Now all of your updates are saved in your file respository, and if you want to, you can your snapshot is made. You could very easily stop here and just keep the audit trail on your machine. But the other half of this happy marriage is Github.


What is Github?

Github is a social coding site, and it’s a hosted remote repository.

It lets you back up all of your code online (for free if it’s open source), and it lets you see how other developers do it and learn from their work.

Here’s how you do it:

git push -u origin master

You “push” your code from the “master” branch that’s on your machine to your remote repository on Github, which we call “origin.” (This may have already been configured for you if you cloned your repo from github. Otherwise, you would have to

git remote add origin git@github.com:tommeagher/myrepo.git

)

One of the many great things about Github is that you can see the diff in your files without the Matrix hypnosis of command line.

The last couple commands I’ll tell you are how you get your commits from the remote repository on Github back on your machine.

You can use

 git fetch origin

to gather the updated files from your remote “origin” repository. Now you want to “merge” the changes from the master branch on your remote “origin” repository into your master branch on your machine.

git merge origin/master

Now your files are updated from the remote repository. Now keep coding.

If you want more, you can visit the repo for this class, where I have a cheat sheet for the most common basic uses and commands and more in-depth tutorials.

This really just deals with the basics. We haven’t even had a chance to talk about merge problems, branching, forking or cloning, but you know how to Google, so you can figure it out. Fork my repo, improve the cheat sheet and send me a pull request.

What’s your favorite tip or cheat in git? Leave me a comment. I’d love to read it.

February 26th, 2013

More tips for using OpenRefine

At IRE’s computer-assisted reporting conference this week in Louisville, I am once again teaching a course on using OpenRefine to clean data. Although the program has changed names over the last few months, its features are pretty much the same, and it’s still an amazingly powerful and free tool for cleaning and standardizing difficult data. If you ever find yourself frustrated with typos, misspellings or any number of other mistakes in the data you get from government agencies, OpenRefine will change your life.

I won’t repeat verbatim my Refine tutorial, which you can find here, but I will share some of the resources again and point you to a few other tutorials.

First, check out the updated slides for my class. If you’d like to follow along with the class, you can download our data for the lottery winners, the hospital report cards, and the campaign finance data.

If you’re just looking for a good cheat sheet of Refine’s key functions, try this.

If you thought my class was easy, and you’re looking for more help with Refine, I can recommend these tutorials, from Refine creator David Huynh, developer Dan Nguyen and journalism educator Paul Bradshaw.

What’s your favorite use for OpenRefine? Leave a comment here, drop me an email or mention me on Twitter. I’d love to hear what you think.

December 11th, 2012

Talking data in the Nutmeg State

Update: In mid-December, I led two days of training in data journalism techniques for my colleagues at the New Haven Register. On a Thursday evening, I took the Amtrak train home to New Jersey, feeling good about turning a group of journalists on to the power and fun of incorporating data into their reporting.

The next morning, the horrible shooting at Newtown Elementary School threw our world upside-down. Only now, more than two months later, have I been able to take a moment to revisit the rough notes and links I dumped here in December to try to make them a little more useful.

The first class was an overview discussion of “data journalism” and the many and sundry techniques that encompasses. The second class was a hands-on walk-through of Excel and Google Fusion Tables. The handouts and cheatsheets cover nearly all of the tips we discussed. If you have questions, or would like to know more, please leave a comment or send me an email.

Intro to data - slide deck

Handouts:

Extra reading


Hands-on Excel & Fusion Tables class

Data for Excel exercises

Let’s do something fun and map the data for a story, using Google’s experimental (and free) Fusion Tables program. We’re going to download a spreadsheet in the CSV format of the violent crime rates for every town in Connecticut.

Go to DataHaven.org Go to indicators>Public safety>Community safety>Violent crimes> total violent crime rate
Select all towns and hit the submit button. Click “All years” and hit submit again. Now you can download the CSV, which stands for comma-separated values. It’s essentially a spreadsheet without any of the fancy highlighting or formatting of Excel. It’s a universal format, and when given the option, it’s a good idea to get your data in CSV. Because of its simple structure, it’s easy for a human to read and just about any computer can understand it.

Now, we have our data, but if we want to visualize it, we need a map to put it on. We can get this map from a SHP (pronounced “shape”) file.  Luckily, in many states, the government or academic institutions often have a clearing house for this kind of geographic data. In Connecticut, you can find it at the U. Conn. library. Point your browser toward http://magic.lib.uconn.edu/, then look under the “CT GIS” column and click on the “Boundaries” section.

Grab the “Connecticut Towns” shapefile from 2010. Then you have to convert the file to a format that Fusion Tables can use. Luckily, there is a free website called “Shp Escape” that can help. Give it your shp file and log into your Google account and it will automatically transform the file and ship it into your Fusion Tables account. Some times the site can be a little busy, so it may take a few minutes. Be patient.

Now, that you have your map file in Fusion Tables, you need the data you want to visualize. Go to Google Drive, click the “create” button and choose Fusion Table. Choose the “download.csv” file that you got from Data Haven and put this into your Fusion Table. (Notice, Fusion Tables also allows you to import tabular data directly from a Google Spreadsheeet.

Clean up the column headings in the file to make the more meaningful. Now we’re going to merge the data file back to the shape file we already uploaded. Under the “File” dropdown menu, choose “Merge.” Find your Connecticut Towns shape table. And you have to determine which field to join the tables on. What field is the same in both tables? Notice that the “town” field in the crime table matches the “NAME10″ field in the shape file. That’s your key to join the tables together. Just be certain that each town name only appears once in that column in each table (in this example, it does).

Once it sends you into Fusion Tables, click the “About this table” link under the File dropdown menu and “edit table information.” Update the name of the table, give it a meaningful description, attribution and a link to the original download. This can help you later to retrace your steps and bulletproof your reporting. It will also help others who look at your work and want to see your primary source material. Click on the “Share” button and make the map public.

We have both tables combined together into one, new, merged table. But if you click on the “Map” tab, it probably still doesn’t look right. We need to style the representation of the data. While you’re on the map tab, go to the “Tools” dropdown and click on “Change Map Styles.” This will allow us to choose how we want to color code the towns based on their crime rate. Instead of points, we want to click on “Fill color” under the “Polygons” section. Points would be if we were color-coding pins on individual addresses. Instead, we want to shade an entire town based on its crime rate, thus we want “polygon.” For the fill color, let’s pick the “buckets” tab, which means we can choose four different colors to represent the range of crime rates in our area. If you want to fancy, you can choose whatever color you want for each bucket. Hit the “save” button. Are all of your towns filled in with a color? (You can ignore the “Southwick Jog“). If not, then you probably haven’t set the range of values for your crime rate to the include all of them. Adjust the high and low ends of your range until none of your towns are blank. This may take some jiggering.

Play with the bucket sizes and colors of the buckets. What makes sense? Is everything in there?

Click on the title to change the attribution and background info for the merged table. Share the merged table and make it public. Then click Tools> Publish. This gives you the embed code for you to put it in your story.

Once you get the hang of this, you can but producing web-ready interactive maps in 20-30 minutes, a task that would have taken hours or days not all that long ago. This is an awesome tool.

There are a few more things that you can tweak, if you like. You can go to Tools>Change info window layout, to tweak the information that’s displayed when you click on a polygon.

And one last tip, you don’t always have to upload your own data or map files. You can go to Help>Search Public Tables and browse through all of the data that other Fusion Tables users have uploaded. Of course, as with all reporting, the caveat is to mindful of the source. Always confirm the information and data with other parties. But if you find a good shapefile, you could use it again and again. It’s definitely worth exploring.

That’s the super fast and dirty introduction to Fusion Tables. Now, go explore it some more on your own. What’s your favorite trick in Fusion Tables? Drop me an email, leave a comment or mention me on Twitter and I’ll add it here.

Tipsheets and tutorials