Creating the data files

The data itself must be stored in one or more CSV files. CSV, which stands for "comma-separated values", is an extremely simple data format that's supported by a large number of applications. In the case of Miga, the top row of each CSV file must be a header row, that holds the names of each column.

There are various ways to generate a CSV file. You can actually create one manually, if your data isn't stored in some structured way already:

Using the importers

There are three importer scripts that connect to an outside system (at the moment, all three connect to different types of MediaWiki wikis) and create a CSV file based on its data. All three are PHP scripts, that require creating a separate file that describes the relevant aspects of the outside system, and then passing that file as an argument to the script.

The other file, known as the "import settings" file, must also be in PHP - it needs to define variables for the import, and it can also optionally include custom handling code for specific fields, although ideally that is not necessary.

You would call an importer in the following way:

php [importer-name] [import-settings-file]

An example would be:

php ../../importers/MediaWikiImporter.php FruitsImportSettings.php

(This assumes that the importer script is being called from within the directory for this app, i.e. /apps/app-name.)

There should be one import settings file per CSV file; and thus, if an app has more than one CSV file, it can have multiple import settings files. In fact, it's possible for an app to have different files from different systems, although in practice this may be unlikely.

There are some variables that are common across all importers:

MediaWikiImporter.php

The MediaWiki importer gets all its data from infobox templates within the pages of a MediaWiki-based wiki - Wikipedia or otherwise. There are three ways to specify the set of pages whose data will be extracted:

The variable $gImportAPIURL holds the API URL of the wiki being accessed - this is where the request is made to get the list of pages in a category, or the list of pages that call a certain template.

The variable $gImportWikiSubstring holds the substring prepended to page names in order to generate the URL for that page, if there's a "_url" field specified.

Below is the import settings file to create a CSV file for infobox information about sports cars in Wikipedia. Note the commented-out setting of $gImportPageList - this variable can be quite helpful in debugging.

<?php

$gImportFileName = "Cars.csv";
$gImportAPIURL = "http://en.wikipedia.org/w/api.php";
$gImportWikiSubstring = "https://en.wikipedia.org/wiki/";
$gImportCategoryName = "Sports cars";
//$gImportPageList[] = "Alfa Romeo 145";
$gImportTemplateName = "Infobox automobile";
$gImportFields = array(
        'Name' => 'name',
        'Manufacturer' => 'manufacturer',
        'Production' => 'production',
        'Class' => array( 'class', 'lowercase' => true ),
        'Length, in mm' => array( 'length', 'units' => 'mm' ),
        'Image' => '_thumbnail',
        'URL' => '_url'
);

?>

SemanticMediaWikiImporter.php

The Semantic MediaWiki importer gets data from a wiki that makes use of Semantic MediaWiki to store its data. It is quite a bit simpler than the MediaWiki importer, because its data can be extracted via a programmatic query, and not web scraping. The $gImportCategoryName and $gImportSpecialAskURL variables are required.

Here is an example of an import settings file, for the SMW-based wiki Food Finds:

<?php

$gImportFileName = "Food_finds.csv";
$gImportSpecialAskURL = "http://foodfinds.referata.com/wiki/Special:Ask";
$gImportCategoryName = "Category:All Good Eats";
$gImportFields = array(
        'Name' => '_name',
        'Cuisine' => 'Cuisine',
        'Address' => 'Location',
        'City' => 'City',
        'Country' => 'Country',
        'Price range' => 'Price Range',
        'Comments' => 'Comments',
        'Coordinates' => 'Coordinates'
);

?>

There are two settings within Semantic MediaWiki that limit the number of rows that can be obtained for any one category; if you have administrative access to the SMW installation, you may need to increase both of them in order to get the full set of results you need:

WikidataImporter.php

WikidataImporter.php is the simplest of the importers to deal with, because it can only deal with one site - wikidata.org.

Because the ability to query groups of pages in Wikidata is at the moment almost nonexistent, it may be necessary to query Wikipedia in order to get the list of pages - as with the MediaWiki importer, you can use the variables $gImportCategoryName, $gImportTemplateName or $gImportPageList to set the list of pages to be extracted from Wikidata.

Here is an import settings file used to get data about countries from Wikidata:

<?php

$gImportFileName = "Countries.csv";
$gImportAPIURL = "http://en.wikipedia.org/w/api.php";
$gImportTemplateName = "Infobox country";
$gImportFields = array(
        'Name' => '_name',
        'Continent' => 'continent',
        'Capital' => 'capital',
        'Language' => 'official language',
        'Form of government' => 'basic form of government',
        'Currency' => 'currency',
        'Borders' => 'shares border with',
        'URL' => '_url'
);

?>