Creating the data files
The data itself must be stored in one or more CSV files. CSV, which stands for "comma-separated values", is an extremely simple data format that's supported by a large number of applications. In the case of Miga, the top row of each CSV file must be a header row, that holds the names of each column.
There are various ways to generate a CSV file. You can actually create one manually, if your data isn't stored in some structured way already:
- If your data is in a spreadsheet, nearly all spreadsheet applications allow you to save the data as a CSV file. If you do that, make sure to save file in "Unicode (UTF-8)" encoding - this will greatly increase the likelihood that the CSV file will be parsed correctly, and that all characters will display correctly on the screen.
- If the data is in a database, there are various ways to produce a query that exports the contents of any table to a CSV file; see here for one way to do it, using MySQL. You can also just run a normal SELECT call to display the set of data that you need as a table, then copy and paste that table into a spreadsheet application and save it to CSV that way.
- Miga comes with some built-in "importers" that help you extract data from specific applications. Owing to its roots as a MediaWiki spinoff, its three current importer scripts are all MediaWiki-based: there's one for regular MediaWiki wikis (including Wikipedia), one for wikis that use Semantic MediaWiki to store their data, and one for Wikidata, the (soon-to-be massive) data repository.
Using the importers
There are three importer scripts that connect to an outside system (at the moment, all three connect to different types of MediaWiki wikis) and create a CSV file based on its data. All three are PHP scripts, that require creating a separate file that describes the relevant aspects of the outside system, and then passing that file as an argument to the script.
The other file, known as the "import settings" file, must also be in PHP - it needs to define variables for the import, and it can also optionally include custom handling code for specific fields, although ideally that is not necessary.
You would call an importer in the following way:
php [importer-name] [import-settings-file]
An example would be:
php ../../importers/MediaWikiImporter.php FruitsImportSettings.php
(This assumes that the importer script is being called from within the directory for this app, i.e. /apps/app-name.)
There should be one import settings file per CSV file; and thus, if an app has more than one CSV file, it can have multiple import settings files. In fact, it's possible for an app to have different files from different systems, although in practice this may be unlikely.
There are some variables that are common across all importers:
- $gImportFileName - sets the name of the CSV file that will be generated by this script. It's a good idea to set this name to something different from the real name of the file, so that you can check the data generated before replacing the file that's already being used (if any).
- $gImportFields - sets the fields that will be imported, and some explanatory information about each. This is an array whose format differs slightly between the different importer scripts. However, it structure is fairly standard: the "keys" are the column names that will go into the top row of the CSV file, while the "values" hold the information needed to extract and, in some cases, process the data for each cell. If the value is the simple text "_name", "_url" or "_thumbnail", that is a directive that the value should be the name, URL or designated thumbnail image, respectively, of the current web page - if each row corresponds to a specific web URL, which is the case for the three existing importers. If the value is otherwise a simple string, that string is the name of the field to be extracted. If the value is an array, the first string in the array is the field name, while remaining values are directives on how to process the text - 'lowercase' indicates that the value should be lowercased, and 'units' indicates that the value should be converted to a specific unit.
The MediaWiki importer gets all its data from infobox templates within the pages of a MediaWiki-based wiki - Wikipedia or otherwise. There are three ways to specify the set of pages whose data will be extracted:
- All pages in a category - use $gImportCategoryName for this. This will get data from every page in a category, or in any of its subcategories, that contains the specified infobox.
- All pages that use a specific infobox - use $gImportTemplateName for this; although that variable has to be set in any case, so this is the simplest option.
- Manually specify the set of pages to be imported - use $gImportPageList for this. That variable holds an array of page names.
The variable $gImportAPIURL holds the API URL of the wiki being accessed - this is where the request is made to get the list of pages in a category, or the list of pages that call a certain template.
The variable $gImportWikiSubstring holds the substring prepended to page names in order to generate the URL for that page, if there's a "_url" field specified.
Below is the import settings file to create a CSV file for infobox information about sports cars in Wikipedia. Note the commented-out setting of $gImportPageList - this variable can be quite helpful in debugging.
<?php $gImportFileName = "Cars.csv"; $gImportAPIURL = "http://en.wikipedia.org/w/api.php"; $gImportWikiSubstring = "https://en.wikipedia.org/wiki/"; $gImportCategoryName = "Sports cars"; //$gImportPageList = "Alfa Romeo 145"; $gImportTemplateName = "Infobox automobile"; $gImportFields = array( 'Name' => 'name', 'Manufacturer' => 'manufacturer', 'Production' => 'production', 'Class' => array( 'class', 'lowercase' => true ), 'Length, in mm' => array( 'length', 'units' => 'mm' ), 'Image' => '_thumbnail', 'URL' => '_url' ); ?>
The Semantic MediaWiki importer gets data from a wiki that makes use of Semantic MediaWiki to store its data. It is quite a bit simpler than the MediaWiki importer, because its data can be extracted via a programmatic query, and not web scraping. The $gImportCategoryName and $gImportSpecialAskURL variables are required.
Here is an example of an import settings file, for the SMW-based wiki Food Finds:
<?php $gImportFileName = "Food_finds.csv"; $gImportSpecialAskURL = "http://foodfinds.referata.com/wiki/Special:Ask"; $gImportCategoryName = "Category:All Good Eats"; $gImportFields = array( 'Name' => '_name', 'Cuisine' => 'Cuisine', 'Address' => 'Location', 'City' => 'City', 'Country' => 'Country', 'Price range' => 'Price Range', 'Comments' => 'Comments', 'Coordinates' => 'Coordinates' ); ?>
There are two settings within Semantic MediaWiki that limit the number of rows that can be obtained for any one category; if you have administrative access to the SMW installation, you may need to increase both of them in order to get the full set of results you need:
- Semantic MediaWiki limits the number of rows per category to 5,000, and this limit is unfortunately currently hardcoded in the SMW code. You can get around the limit by editing the SMW file "/includes/SMW_QueryProcessor.php", finding the value "5000" and changing it to some higher number.
- The variable $smwgQMaxLimit also limits the number of rows; by default it is set to 10000. If your desired number of rows is higher than that, you should increase the value of this variable in LocalSettings.php.
WikidataImporter.php is the simplest of the importers to deal with, because it can only deal with one site - wikidata.org.
Because the ability to query groups of pages in Wikidata is at the moment almost nonexistent, it may be necessary to query Wikipedia in order to get the list of pages - as with the MediaWiki importer, you can use the variables $gImportCategoryName, $gImportTemplateName or $gImportPageList to set the list of pages to be extracted from Wikidata.
Here is an import settings file used to get data about countries from Wikidata:
<?php $gImportFileName = "Countries.csv"; $gImportAPIURL = "http://en.wikipedia.org/w/api.php"; $gImportTemplateName = "Infobox country"; $gImportFields = array( 'Name' => '_name', 'Continent' => 'continent', 'Capital' => 'capital', 'Language' => 'official language', 'Form of government' => 'basic form of government', 'Currency' => 'currency', 'Borders' => 'shares border with', 'URL' => '_url' ); ?>