Data Guide
Matter42 treats every upload as a project file first. When the agent parses a file, it creates a normalized dataset with a dataset_id; every later tool call should refer to that ID rather than re-reading the raw file.
Supported Uploads
| File type | Examples | What Matter42 creates |
|---|---|---|
| LabSpec hyperspectral maps | .txt Raman or PL maps with spatial coordinates and spectra | hyperspectral_map dataset with x/y grid, spectral axis, spectra cube, quality mask, and measurement metadata when available |
| Single spectra | LabSpec .txt, two-column .csv or .tsv | single_spectrum dataset with spectral axis and intensity |
| Numeric tables | .csv, .tsv, .xlsx, .xls | tabular dataset with numeric columns and optional primary x/y columns |
| Documents | .pdf, .md, .markdown, .mdx, .rst, plain .txt | document dataset for notes, paper excerpts, protocols, or metadata |
| Images | .png, .jpg, .jpeg, .tif, .tiff, .bmp, .gif | image dataset with optional modality and pixel-size metadata |
Unsupported or exotic formats, such as .mat, .npy, custom HDF5, multimodal bundles, or OCR-required images, need the guided parse_upload path. That path inspects the file outside the model context, writes a canonical HDF5 envelope, and registers it with register_dataset.
Parser Hints
The agent can pass hints to parse_data when the file or sample needs extra context:
| Hint | Use it when |
|---|---|
data_type="raman" or data_type="pl" | The spectral axis is ambiguous or the filename does not identify the measurement |
boundary_buffer_um=2.0 | PFIB milling, etched holes, tears, or damaged boundaries should be excluded from interior analysis |
description="..." | You want the project file to carry sample prep, dose, anneal, or instrument notes |
primary_x / primary_y | A table has obvious plotting axes |
modality="sem", "tem", "optical", or "afm" | An image should be interpreted as a known microscopy modality |
pixel_size_um=... | Image measurements need a physical scale |
For Raman density estimates, also tell the agent your instrumental linewidth if you know it:
Use instrument_fwhm=1.6 cm^-1 for the density and classification tools.
If you do not provide it, the backend can sometimes infer an instrument contribution from the Si peak or from instrument metadata, but explicit calibration is better for quantitative work.
What Gets Extracted
For hyperspectral maps, Matter42 stores the spectra as a spatial cube and derives maps from physics-relevant windows. Raman maps focus on E2g and A1g features: intensity, peak center, FWHM, ratio, and asymmetry where measurable. PL maps focus on total intensity, peak position, FWHM, asymmetry, quenching, trion/exciton balance, and sub-gap emission.
The parser also creates a quality mask. Downstream region-aware tools can further split pixels into:
quality: parse-time valid pixels.interior: valid pixels eroded away from damaged boundaries.transition: boundary halo pixels.damaged: masked or artifact regions.all: every pixel, including masked pixels.
Data Hygiene
Upload raw files when possible. Do not paste large spectra or maps into chat; attach the file so bytes stay in storage and only paths, URLs, dataset IDs, and summaries enter the conversation.
Keep matched Raman and PL maps as separate uploads. Ask the agent to parse both and pass the second dataset as an auxiliary input. The analysis tools handle grid overlap and nearest-neighbor alignment.
For maps with sample damage, add experimental context up front: PFIB dose, boundary buffer, known etched regions, multilayer regions, laser wavelength, grating, and whether the map includes a Si calibration peak.
Good Data Questions
- "Is this Raman map high-defect or low-defect, and where are the damaged regions?"
- "Estimate vacancy percentage from E2g broadening, using only the interior region."
- "Which defect family best matches the Raman fingerprint?"
- "Do PL-quenched regions spatially overlap with Raman-broadened regions?"
- "Show me the mean spectrum and annotate peaks."
- "Compare these two annealing conditions using the same region settings."
- "For this image/table/document, summarize the usable metadata and suggest what analysis is possible."

