How to Extract Tables from HTML to JSON or CSV

Understanding HTML Table Structure

HTML tables are one of the most common ways data is presented on the web. From financial reports and sports statistics to product comparison charts and government datasets, tables remain the standard for displaying structured, row-and-column data in a browser. Understanding the anatomy of an HTML table is the first step to extracting data from one.

A basic HTML table consists of a <table> element containing <thead> (header), <tbody> (body), and optionally <tfoot> (footer) sections. Each row is a <tr> element, and each cell is either a <th> (header cell) or <td> (data cell):

<table>
  <thead>
    <tr>
      <th>Name</th>
      <th>Price</th>
      <th>Stock</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Widget A</td>
      <td>$9.99</td>
      <td>150</td>
    </tr>
    <tr>
      <td>Widget B</td>
      <td>$14.99</td>
      <td>89</td>
    </tr>
  </tbody>
</table>

Why Scrape HTML Tables?

Many websites present valuable data in tables but do not offer an API or downloadable file. Researchers tracking economic indicators, analysts monitoring competitor pricing, and journalists compiling public records often need to extract table data from web pages into a structured format. Manually copying and pasting table data is tedious and error-prone, especially when tables span hundreds of rows.

Automated table extraction saves hours of manual work and reduces the risk of human error. By converting an HTML table to JSON or CSV, you can immediately import the data into spreadsheets, databases, or analytics tools for further processing.

JSON vs CSV Output: Which to Choose

When extracting table data, you typically have a choice between JSON and CSV output formats. Each has its strengths depending on your use case:

JSON is ideal if you plan to use the data in a web application, API, or NoSQL database. Each row becomes an object with named keys, making it self-describing and easy to work with in JavaScript or Python.
CSV is the better choice if you want to open the data in Excel, Google Sheets, or any spreadsheet application. It is also the preferred format for data science workflows using pandas or R.

PulpMiner's HTML Table Extractor supports both output formats, so you can choose the one that fits your workflow best.

How to Use the PulpMiner HTML Table Extractor

To extract tables from HTML, paste the HTML source code containing one or more tables into the input editor. The tool will automatically detect all <table> elements and display them in a list. If the HTML contains multiple tables, you can select which one to extract.

The extractor reads the header row to create column names and then iterates through each body row to build the dataset. You can switch between JSON and CSV output with a single click. The resulting data is ready to copy or download immediately.

Handling Complex Tables

Real-world HTML tables are not always clean. Some common complications include merged cells using colspan and rowspan attributes, nested tables within cells, and tables that use <div> elements styled to look like tables instead of semantic HTML table elements.

The PulpMiner extractor handles colspan by distributing the cell value across multiple columns and rowspan by carrying the value down to subsequent rows. For div-based "tables" that do not use semantic HTML, you may need to inspect the page source and identify the correct CSS selectors to target the data rows and cells.

Data Cleaning Tips After Extraction

After extracting table data, you will often need to clean it up before it is usable. Here are some common data cleaning steps:

Remove extra whitespace: HTML tables often contain leading and trailing spaces, non-breaking spaces ( ), and newline characters within cells.
Strip HTML tags: Some cells contain links, images, or formatting tags. Extract only the text content.
Normalize numbers: Currency symbols, commas as thousand separators, and percentage signs should be removed or standardized if you need numeric values.
Handle missing values: Empty cells should be represented consistently — as empty strings, null, or a placeholder value depending on your downstream system.
Standardize dates: Tables often display dates in locale-specific formats. Convert them to ISO 8601 format for consistency.

Extracting Tables Programmatically

If you need to extract tables from HTML in your code, here is an example using JavaScript and the DOM API:

function extractTable(html) {
  const parser = new DOMParser();
  const doc = parser.parseFromString(html, "text/html");
  const table = doc.querySelector("table");
  if (!table) return [];

  const headers = Array.from(table.querySelectorAll("thead th"))
    .map((th) => th.textContent.trim());

  return Array.from(table.querySelectorAll("tbody tr")).map((row) => {
    const cells = Array.from(row.querySelectorAll("td"));
    return headers.reduce((obj, header, i) => {
      obj[header] = cells[i]?.textContent.trim() ?? "";
      return obj;
    }, {});
  });
}

For one-off extractions during development or research, the PulpMiner HTML Table Extractor is much faster than writing custom parsing code. Simply paste the HTML, select your table, and export the data.

Try the HTML Table Extractor